[Bug 58525] New: Optimization of adding/modifying named cells

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 58525] New: Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

            Bug ID: 58525
           Summary: Optimization of adding/modifying named cells
           Product: POI
           Version: 3.13-dev
          Hardware: PC
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: HSSF
          Assignee: [hidden email]
          Reporter: [hidden email]

Created attachment 33196
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=33196&action=edit
patch file created with Git.

If a large number of named cells are used, a great deal of time is spent
determining if the name is already used in a given sheet due to looking up the
name by looping through an array of names.

This optimization adds a Map<Integer sheetNumber, Map<String name, NameRecord>>
wherever the array of names exists. The NameRecord may be an HSSFName depending
on which model is being used.

The duplicate test in HSSFName.setNameName() uses the map to find a duplicate
if one exists rather than looping through the names array.

As far as I know this does not break any existing functionality. All unit tests
executed in the ant build are successful although none of the unit tests have
been modified to use the new getName(name, sheetNumber) method.

This change was prompted by a project that builds a large excel workbook of 58
sheets with about 3000 named cells per sheet. The time required to produce the
workbook was 4:47 before the modification and 3:14 after, about a 30%
improvement.

The change is based on poi-3.13. To my knowledge this is a release version.

My local project is named poi-3.13-pwc. The included patch references that
project name.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |PatchAvailable
            Summary|Optimization of             |[PATCH] Optimization of
                   |adding/modifying named      |adding/modifying named
                   |cells                       |cells
                 OS|                            |All

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

--- Comment #1 from Javen ONeal <[hidden email]> ---
It looks like you're maintaining two data structures that contain essentially
the same content, a List and a HashMap. In the interest of keeping POI's memory
footprint down, would it be possible to use just one data structure? It might
even be faster, since write operations wouldn't need to modify 2 data
structures.

> public LinkTable(int numberOfSheets, WorkbookRecordList workbookRecordList) {
>   _workbookRecordList = workbookRecordList;
>   _definedNames = new ArrayList<NameRecord>();
>   _nameRecordMap = new HashMap<Integer, Map<String, NameRecord>>();
>   ...
>   while(true) {
>     if (nextClass == NameRecord.class) {
>       NameRecord nr = (NameRecord)rs.getNext();
>       _definedNames.add(nr);
>       addToNamesMap(nr);
>     }
>   }

> public void removeBuiltinRecord(byte name, int sheetIndex) {
>  NameRecord record = getSpecificBuiltinRecord(name, sheetIndex);
>   if (record != null) {
>    _definedNames.remove(record);
>    removeFromNameRecordMap(record);
>   }
> }

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

--- Comment #2 from Richard Hart <[hidden email]> ---
(In reply to Javen ONeal from comment #1)

> It looks like you're maintaining two data structures that contain
> essentially the same content, a List and a HashMap. In the interest of
> keeping POI's memory footprint down, would it be possible to use just one
> data structure? It might even be faster, since write operations wouldn't
> need to modify 2 data structures.
>
> > public LinkTable(int numberOfSheets, WorkbookRecordList workbookRecordList) {
> >   _workbookRecordList = workbookRecordList;
> >   _definedNames = new ArrayList<NameRecord>();
> >   _nameRecordMap = new HashMap<Integer, Map<String, NameRecord>>();
> >   ...
> >   while(true) {
> >     if (nextClass == NameRecord.class) {
> >       NameRecord nr = (NameRecord)rs.getNext();
> >       _definedNames.add(nr);
> >       addToNamesMap(nr);
> >     }
> >   }
>
> > public void removeBuiltinRecord(byte name, int sheetIndex) {
> >  NameRecord record = getSpecificBuiltinRecord(name, sheetIndex);
> >   if (record != null) {
> >    _definedNames.remove(record);
> >    removeFromNameRecordMap(record);
> >   }
> > }

Indeed. I first tried to replace the names array with a map but as I got
deeper, more and more would have to be changed. The issue is so many methods
use the getName(index) method which is not only fast but cannot be done on a
map. I did not want to break any existing functionality so I added the map.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

--- Comment #3 from Richard Hart <[hidden email]> ---
If I am not mistaken, the names array also contains names that are empty ("")
such as when the cell is first created. Giving the cell a name is not required.
The map can have only one empty name. So there is a difference between the two
stores. The map contains only names whose length > 0.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

--- Comment #4 from Javen ONeal <[hidden email]> ---
There might be some off-the-shelf solutions available for this, though the
underlying implementation of MultiKeyMap would determine if this saves any
memory over maintaining a List and a Map in parallel--at least a hybrid data
structure would relieve the developers of touching twice as many data
structures and twice as many functions.
https://commons.apache.org/proper/commons-collections/javadocs/api-3.2.1/org/apache/commons/collections/map/MultiKeyMap.html

Alternatively, if the performance hit is small enough, some type of
SortedMap/TreeMap/LinkedHashMap might work here:
getName(int index) -> map.values().toArray()[index]

If those suggestions don't work, you could roll your own hybrid data structure
to maintain two underlying data structures in parallel. Better encapsulation.

> (In reply to Richard Hart from comment #3)
> If I am not mistaken, the names array also contains names that are empty
> ("") such as when the cell is first created. Giving the cell a name is not
> required. The map can have only one empty name. So there is a difference
> between the two stores. The map contains only names whose length > 0.

Hmm.. another wrinkle in the design. Let me think about this problem a bit
more. It's starting to sound like a custom-built hybrid data structure will be
needed here. Maybe that custom data structure is LinkTable.java.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 58525] [PATCH] Optimization of adding/modifying named cells

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=58525

ithan <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]