[Bug 58740] New: [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 58740] New: [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

            Bug ID: 58740
           Summary: [PATCH] Fix O(n^2) behavior when generating XLSX files
                    with large number of styles
           Product: POI
           Version: 3.13-FINAL
          Hardware: PC
                OS: Mac OS X 10.1
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSSF
          Assignee: [hidden email]
          Reporter: [hidden email]

Created attachment 33352
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33352&action=edit
Patch that fixes the behavior described

When generating XSLX files for a spreadsheet that contains a large number of
cell styles, there is extreme slowness caused by the use of ArrayList.indexOf()
which is O(n) time, leading to overall O(n^2) behavior.

This is easily fixed by keeping a reverse mapping from list entry to list
index. Since the list of styles is only appended to, this is easy.

The attached patch reduced the time to generate my XSLX output from several
minutes to under 10 seconds.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|Mac OS X 10.1               |All
           Severity|normal                      |enhancement

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #1 from Javen O'Neal <[hidden email]> ---
Out of curiosity, how many styles are you creating that cause the reported
several minutes/10 seconds times? Are any of the styles identical? I ask
because most workbooks I create use under 100 styles. Unless your workbook is a
styles demo that shows every permutation of font, data format, border,
background and foreground color, it seems difficult to get a style count high
enough in most applications where there wouldn't be duplicates.

I've rolled my own code to manage styles so I can avoid creating millions of
styles. Usually I want to change the data format of an existing cell without
affecting other cells that use the same style. Rather than cloning the cell's
style (that's how you get millions of styles), I temporarily change the cell
style's data format, search the style table if there's another style that
matches, and if so change the cell's style reference to the match, otherwise I
clone the style. Finally, I revert my dataformat change to the original style.
Because I apply style changes to thousands of cells, I have some extra data
structures to make the style lookup process faster than linearly searching the
style table.

I mention this because it may solve your problem if you don't really need 1
million styles.

POI could benefit from a way to consolidate duplicate styles, either when the
workbook is written to disk, or through an explicit call.

Glancing at your patch, it looks like your change adds some data structures.
The consequence is:
1) higher memory consumption
2) extra processing power updating multiple data structures
3) potential for the data structures to get out of sync, especially considering
projects that subclass POI.

My recommendation is use a single data structure that is a container for cell
styles, that combines the features that you need that will give fast by-index
and by-style lookup. If such a data structure isn't available off-the-shelf,
you may want to write your own. Most trivially, this is just a class that
contains an ArrayList and a HashMap for inverted array lookups, and any call to
the hybrid data structure would update both underlying data structures.
Encapsulating the complexity is the key to solving my 3rd concern, and makes
solving #2 and #1 easier to solve down the road.

Alternatively, if you can read the styles from the array into a hashmap, clear
the array, and only maintain a hashmap throughout the life of the style table,
and then recreate the array when you need to save the style table, you've saved
yourself the memory and performance overhead, and also avoided the potential
for out-of-sync data structures so long as you clearly mark the array as
unmaintained (maybe clear it or set it to null, or don't make it an instance
variable, plus comments).

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #2 from Archie Cobbs <[hidden email]> ---
(In reply to Javen O'Neal from comment #1)
> Out of curiosity, how many styles are you creating that cause the reported
> several minutes/10 seconds times?

I have one column where the background color is set according to the
data value as a visual aid. Lots of values lead to lots of colors and
therefore styles. Even limiting to couple thousand colors leads to this
problem.

> 1) higher memory consumption

Yes but insignificantly. A small constant number of bytes per style.

> 2) extra processing power updating multiple data structures

Huh?!? You're missing the whole point.

If you have non-trivial number of styles, there will be MUCH less
processing power utilized because we eliminate the stupid O(n^2) behavior.

For a trivial number of styles, the overhead is minimal - linear time in
the number of styles, which by definition, is small.

> 3) potential for the data structures to get out of sync, especially
> considering projects that subclass POI.

Wrong - the fields are all private. It's not possible for a subclass
to get things out of sync.

> My recommendation is use a single data structure that is a container for
> cell styles, that combines the features that you need that will give fast
> by-index and by-style lookup. If such a data structure isn't available
> off-the-shelf, you may want to write your own. Most trivially, this is just
> a class that contains an ArrayList and a HashMap for inverted array lookups,

That's a possible refinement and I considered it. My goal was to address
the immediate problem.

You are letting the perfect be the enemy of the good. I've provided a
reasonable patch to address what is truly stupid behavior. My recommendation
is to use this patch to "put out the fire" so to speak, and we can refine it
later at a lower priority.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #3 from Nick Burch <[hidden email]> ---
In terms of the style optimisation, we do have a style optimiser for HSSF, but
nothing for XSSF. Some of the logic could be re-used, but a fair bit would need
writing

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #4 from Javen O'Neal <[hidden email]> ---
(In reply to Archie Cobbs from comment #2)
> I have one column where the background color is set according to the
> data value as a visual aid. Lots of values lead to lots of colors and
> therefore styles. Even limiting to couple thousand colors leads to this
> problem.

Okay. Sounds like you really do need that many styles if the style is dependent
on the value. This isn't by any chance related to Excel's Conditional
Formatting Color Scales [1]? This would still probably require as many styles
as there are values, but would be easier to work with.


I'll take a look at your patch in context with the code later this week to see
if we can get away with maintaining just a single data structure.

Thank you for the patch!

[1]
https://support.office.com/en-us/article/Highlight-patterns-and-trends-with-conditional-formatting-eea152f5-2a7d-4c1a-a2da-c5f893adb621

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #5 from Archie Cobbs <[hidden email]> ---
(In reply to Javen O'Neal from comment #4)
> Okay. Sounds like you really do need that many styles if the style is
> dependent on the value. This isn't by any chance related to Excel's
> Conditional Formatting Color Scales [1]?

No - I'm not familiar with that. Sounds like a neater solution though.

Does OOXML and/or the POI API have support for that?

> I'll take a look at your patch in context with the code later this week to
> see if we can get away with maintaining just a single data structure.

Great, thanks!

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #6 from Nick Burch <[hidden email]> ---
(In reply to Archie Cobbs from comment #5)
> Does OOXML and/or the POI API have support for that?

Yes! Take a look at the colourScales method of the ConditionalFormats example
program to get started, then the related unit tests if you need more

(Note that if you write a colour scale into a .xls or .xlsx file, it will only
render properly if you open in a new enough copy of Excel. Older versions will
skip over the records/xml without rendering anything)

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #33352|0                           |1
        is obsolete|                            |

--- Comment #7 from Javen O'Neal <[hidden email]> ---
Created attachment 33382
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33382&action=edit
Rebased attachment 33352 to r1721943.

I committed the trivial changes from attachment 33352. I've rebased the patch
to r1721943.

This patch should use a Bidirectional Map or other data structure so that the
StylesTable doesn't become too difficult to read. I've posted a question to the
dev mailing list for feedback.

This test also fails the current unit test suite. Looks to be some issues with
ArrayIndexOutOfBounds.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #8 from Javen O'Neal <[hidden email]> ---
Created attachment 34235
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34235&action=edit
patch showing mixed based indices

Some of the methods return list.size() before adding an item and other methods
return list.size() after adding an item. The effect is whether a 0-based list
index or 1-based list index is returned. Without better javadocs or comments
saying which should be which, I'm inclined to believe that there are some bugs
hiding in the attached code.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #9 from Javen O'Neal <[hidden email]> ---
Created attachment 34239
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34239&action=edit
MappedList data structure that combines a List and a TreeSetValuedHashMap to
improve list reverse lookup speed

I wrote a general-purpose class using commons-collections4.1 (this patch was
generated off the commons collections trunk, not POI). The indexOf and contains
methods were significantly faster, but insertion and removal were pitifully
slow (it's slow to shift all the elements in an array by 1 for an ArrayList
implementation, but this class is even slower because it has to shift the
values of a map--from my trials, it's faster to rebuild the entire map when an
insertion or deletion happens that isn't at the end of the list.

Some sample timing:
Each list was initialized with 10,000 elements and 10,000 operations of each
type were performed in succession. Timing is relative and seems to be
influenced by JIT. Repeatability is around +/-10%, so I have rounded to 2
significant figures. Low numbers are better.
>                        add; toArray; iterator; insert; get; indexOf; contains; remove
>             ArrayList = 2;   1200;       68;    100;    9;    310;      220;      81;
>            LinkedList = 3;   1800;     1500;    350;  420;   1000;     1000;     400;
> NodeCachingLinkedList = 4;   2600;     1900;    250;  350;    860;      830;     370;
>              TreeList = 4;   1300;     3400;     68;   10;   1500;     1500;      55;
>            MappedList = 52;  1600;     1900;  20000;    0;     64;        6;   21000;
> Elapsed time: 1m8s

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #10 from Javen O'Neal <[hidden email]> ---
Created attachment 34241
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34241&action=edit
a non-working MapUniqueList class that mimicks an associative SetUniqueList

If we can remove the scenario where duplicate entries may exist in the lists
then we could write a MapUniqueList (see
org.apache.commons.collections4.list.SetUniqueList), which may be more
performant.

Looking at the usage of these lists, we only use List.add(E), List.indexOf(E),
and List.size(), so we may not run into the performance bottlenecks when
elements in the array are shifted (and this makes sense--we wouldn't want to
change the style index as there are other data structures that only reference
the index, not the object itself.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #11 from Archie Cobbs <[hidden email]> ---
Created attachment 34275
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34275&action=edit
Version of original patch using Guava's BiHashMap class

Just for what it's worth, I'm attaching a patch against 3.14 that uses Guava's
BiHashMap, which is a bidirectional Map.

This makes the logic much simpler.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #12 from Javen O'Neal <[hidden email]> ---
Apache Commons collections4 is available to us, but a BiDiMap does not work
because values (and therefore keys in the inverted map) may not be unique.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

--- Comment #13 from Archie Cobbs <[hidden email]> ---
It looks like this issue may be addressed in poi-3.17. My tests show better
performance.

Can you confirm? Thanks.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58740] [PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58740

ithan <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]