[Bug 58896] New: autoSizeColumn() is too slow

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug 58896] New: autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

            Bug ID: 58896
           Summary: autoSizeColumn() is too slow
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HSSF
          Assignee: [hidden email]
          Reporter: [hidden email]

So, it goes without saying that a lot of people have complained about the
performance of autoSizeColumn(). I followed all the online advise about only
invoking this method once per column once the worksheet is fully populated, but
it is still extremely slow (30 seconds in POI, 1 second using the Excel GUI).

The worksheet I am formatting has 139 columns and 160 rows.

Is there an existing task for optimizing this function? Has anyone taken a
profiler to it and investigated if there are is any low-hanging fruit?

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #1 from Javen O'Neal <[hidden email]> ---
I've looked through the code 6 months back because I had the same observation.

My conclusion was that nothing stood out as wastefully inefficient. The
contents of every cell (formulas evaluated as necessary) in a column was
rendered in RichText, and a minimum bounding box around the RichText was
calculated. This was done exactly once per non-blank cell in the column.

To make this faster, while auto-sizing a column, we could cache cells with the
same value and formatting to avoid re-calculating the RichText box, but this
comes at the cost of memory and a slight performance hit when the majority of
cell values in a column are unique. Alternatively, the workbook could cache the
column widths when cell values are modified, which would result in a net
increase in memory consumption and time, and changing any formula would
invalidate the calculated column widths. Excel could get away with the latter
approach since the marginal cost to auto-sizing one cell is negligible when
using a GUI. However, I don't know if this is what Excel does.

Feel free to step through the autosize code yourself to see if you can find
something I couldn't. More eyes are better.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #2 from Dominik Stadler <[hidden email]> ---
Analysis showed no apparent general suitable way to increase performance
significantly without incurring higher memory requirements (which we would like
to avoid).

As your case seems odd with a very high runtime with only few rows/columns (in
other places we do autosizing on tens of thousands of rows without much delay),
so it would be good if you can do some profiling to find which parts of the
code are causing this.

Alternatively please attach both a sample file and sample code to reproduce the
problem so somebody else can look at your specific code.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #3 from Javen O'Neal <[hidden email]> ---
To test out the times, I wrote a unit test (r1730997) that creates a file with
the following contents:
160 rows, 139 columns, with the content of each cell as "Cell[r=159, c=138]"
using the workbook default style. The time required to calculate the best fit
width is a function of the number of characters and number of styles/fonts.
Likely due to the JVM's JIT compiler, the time spent calculating the best fit
width for each column decreased over time. With the simplicity of this test,
this is obviously a best case scenario for auto-sizing (hence the reason I
didn't get 30 seconds). I have personally seen the 30 second turtle pace
described by Gili with more complicated spreadsheets (numbers, number
formatting, styles)

                              [1]    [2]    [3]
                             HSSF   XSSF   SXSSF
            Populate sheet:  21ms  344ms    9ms
     Autosize first column:  36ms    4ms   10ms
Autosize first ten columns: 217ms   39ms   84ms
      Autosize all columns: 776ms  447ms  674ms
 Total (populate+autosize): 797ms  791ms  683ms

There isn't a significant difference in the amount of time required to
auto-size columns for each spreadsheet implementation.

[1] http://pasted.co/af5f61a5
[2] http://pasted.co/5e0cd6700
[3] http://pasted.co/41a51c4e

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #4 from Gili <[hidden email]> ---
Created attachment 33635
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33635&action=edit
Input file for testcase

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #5 from Gili <[hidden email]> ---
Created attachment 33636
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33636&action=edit
Testcase

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Gili <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from Gili <[hidden email]> ---
I've attached a testcase for your consideration. On my machine, auto-sizing
columns takes 2 minutes, 12 seconds, 489 milliseconds.

Please let me know if you can reproduce the problem on your end.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #7 from Gili <[hidden email]> ---
Javen and Dominik,

Have you tried running the testcase I attached? I'm wondering if it revealed
anything interesting...

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #8 from Javen O'Neal <[hidden email]> ---
Your input file has a few things beyond my unit test from comment 3:
1) Rich Text Formatting (bold, color)
2) Number formatting ($128,000)
3) Merged cells (Description)

How much do each of these contribute to the 132.5 seconds? Could you time your
auto-size code from comment 5 using an Excel file with:
A) no formatting
B) 1+2+3 formatting
C) 1, 2, and 3 formatting, separately, if A and B are different

As a side-note: if you are using POI to auto-size a file similar to attachment
33635 and you know the workbook structure, you could either manually determine
ideal column widths and use POI to explicitly set the widths, or have POI
calculate the best-fit width of the first group of columns, then apply those
widths to all of the columns. This obviously doesn't work when the content is
less structured or there are longer strings.

Andi suggested that some people could get away with an approximate solution:
> cell width = length(cell value expressed as a string) * width of a typical character in the cell's font.
If we wanted to make this work for mixed-font (RichText) cells, perform the
same computation on each run of characters that have the same font size and
font name. A cheaper version is to assume all fonts have the same character
widths.
> cell width = sum of [(font size)*(average character width at 1pt font size)*(number of characters in a same-size font run)] for each font run.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is too slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #9 from Gili <[hidden email]> ---
Javen,

You hit the nail on the head. Cell formatting has no effect on autoSizeColumn()
but merged cells do.

It doesn't matter whether I invoke autoSizeColumn() with useMergedCells set to
true or false, the performance problem occurs either way. However, if I
un-merge all cells in the worksheet before invoking autoSizeColumn() the
performance problem disappears.

How do we proceed from here?

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Gili <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|autoSizeColumn() is too     |autoSizeColumn() is
                   |slow                        |extremely slow if worksheet
                   |                            |contains merged cells

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #10 from Javen O'Neal <[hidden email]> ---
I wonder how many times the best-fit width of a cell is being calculated in a
merged cell. If more than once per cell, can we cache those calculations or
rewrite the code so unnecessary work isn't performed?
Are we checking if the cell is blank before proceeding to calculate the
best-fit width?
Does useMergedCells=False/True make a difference on speed?

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|HSSF                        |SS Common

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |52834

--- Comment #11 from Javen O'Neal <[hidden email]> ---
Probably the same bug as bug 52834

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #12 from Axel Howind <[hidden email]> ---
I remember observing very bad performance in handling merged cells starting in
POI 3.14(?). IIRC it was not only with autoSizeColumn(), but whenever trying to
query merged regions of a sheet. It could even be that it was just
getNbOfMergedRegions() in XSSF that was so horribly slow. I coded around this
issue.

Maybe I will dig into this again when I find the time.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |60417


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=60417
[Bug 60417] autoSizeColumn(int i) swallows interrupted exception and resets
interrupted flag
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #13 from Manuel Koller <[hidden email]> ---
I was tracking down a slow Excel export and I came across this ticket. There
seems to be at least one optimisation that could help speed up things.

Profiling one very slow export, I noticed that about 60% of the time was spent
calling sheet.getMergedRegions() from SheetUtil.getCellWidth(Cell cell, int
defaultCharWidth, DataFormatter formatter, boolean useMergedCells).

The getCellWidth method is called for each cell in the sheet. Presumably the
return value of sheet.getMergedRegions() doesn't change during an
autoSizeColumn() call. If the return value from getMergedRegions() is either
passed in or otherwise cached, then this would probably speed things up
considerably.

The passing in approach is easy but would probably clutter the interface too
much. But caching the sheet.getMergedRegions() within sheet instances seems
more complicated as presumably each implementation would have to handle that
separately. As the calls are likely to be made for each column in the sheet,
the solution should cover that case, too.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #14 from Vladimir <[hidden email]> ---
I also have a problem with autoSizeColumn with small number of columns and
data.
I have an application which generates XLSX files in different threads.
I have changed the POI version from 3.17 to 4.1.0 and the application became to
hang on the autoSizeColumn(sheet, true) line.
When I had removed this line or had changed to sheet.setColumnWidth(i,
SMALL_COLUMN_WIDTH) the problem was resolved.

There aren't excpetions in the log file!


-----Versions-----

1) OS: Windows 10

2) Java
java version "11.0.3" 2019-04-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.3+12-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.3+12-LTS, mixed mode)

3)
POI version: 4.1.0

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58896] autoSizeColumn() is extremely slow if worksheet contains merged cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58896

--- Comment #15 from Cody Lerum <[hidden email]> ---
For those hitting this with Java 11 & Java 12 see:
https://bz.apache.org/bugzilla/show_bug.cgi?id=63482

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12