[Bug 61371] New: 20^20 hidden cells in a .xlsx file causes heap memory full exception

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] New: 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

            Bug ID: 61371
           Summary: 20^20 hidden cells in a .xlsx file causes heap memory
                    full exception
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Mac OS X 10.1
            Status: NEW
          Severity: major
          Priority: P2
         Component: XSSF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35196
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35196&action=edit
File which causes heap memory out

An .xlsx file has only one cell filled but all other cells are hidden. When
tried to read the file using " workbook = new
org.apache.poi.xssf.usermodel.XSSFWorkbook(in);" this throws an Heap memory
full exception.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Nick Burch <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Nick Burch <[hidden email]> ---
How big a heap did you give to Apache POI? Most JVM default heaps are way way
too small these days

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #2 from Lakshminathan <[hidden email]> ---
The file size as you can see from the attachment is just 9Kb . It has only one
cell with data. But since there are 2^20 hidden cells Apache POI allocates more
heap space causing heap space full exception. I think this can be handled only
by Apache POI because file size is the only information we get even before
reading the file.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #3 from PJ Fanning <[hidden email]> ---
The XSSF model loads all the data in the xlsx file, even if it is hidden, The
XSSF cell interface will let you know which cells are hidden and which are not.
In theory, the XSSF workbook code where we load the xlsx data could have an
option to ignore hidden cells or hidden sheets. This would be an enhancement
and I'm not sure how useful this feature is.
You could also try the XSSFReader which let's you read the xlsx file in a
streaming way.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #4 from Greg Woolsey <[hidden email]> ---
To clarify:

1. even empty hidden cells are still cells, and as such require object
allocations when read using the XSSFWorkbook API.  no space is allocated if
they are truly missing from the workbook - not even defined in the OOXML file.

2. OOXML files are ZIP packages containing mostly XML files.  Cells defined but
empty compress extremely well, so a 9k file can actually unzip to hundreds of
megabytes.  Unzip your sample file and see just how big it really is.  This
will give you a good rough start for estimating the heap space required to read
it using the XSSFWorkbook API.  I'd say double the unzipped size as a good
first guess.

If the result turns out to be too big for the available heap, try using the
streaming API as mentioned by PJ Fanning in comment 3.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Javen O'Neal <[hidden email]> ---
The attached workbook, attachment 35196, is actually quite small uncompressed
and doesn't describe many cells.

xl/worksheets/sheet1.xml:
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1" x14ac:dyDescent="0.25">
    <c r="A1" t="s">
      <v>0</v>
    </c>
  </row>
</sheetData>

sheet2 and 3 are even smaller.

The entire file expands to 28 KB on disk, and I didn't see anything suspicious
in sharedStrings or elsewhere that would cause this workbook to misbehave.

Lakshminathan, please verify whether you're able to read
https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.xlsx
in your application.

Which JVM are you using and what is your max heap size?
What version of POI are you using?
Is there anything else in your application that could be consuming a
substantial part of the heap?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from Lakshminathan <[hidden email]> ---
@Team, Thank you for the response. Please find below the requested information

https://docs.google.com/document/d/1Lf1FEeQDillIrT4fTz-u9nKiUp7s9_60RwTrEApZ4jk/edit?usp=sharing

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #7 from Javen O'Neal <[hidden email]> ---
(In reply to Javen O'Neal from comment #5)
> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application.
>
> Which JVM are you using and what is your max heap size?
> What version of POI are you using?
> Is there anything else in your application that could be consuming a
> substantial part of the heap?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #8 from Javen O'Neal <[hidden email]> ---
Please add supporting information as a comment or an attachment. External
dependencies may break or be inaccessible to some users.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

--- Comment #9 from Lakshminathan <[hidden email]> ---
Created attachment 35199
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35199&action=edit
Information about the bug

> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application. - yes,its working
>
> Which JVM are you using and what is your max heap size? details attached
> What version of POI are you using?3.16
> Is there anything else in your application that could be consuming a
> substantial part of the heap? No

Further details attached.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Lakshminathan <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61371

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #10 from Javen O'Neal <[hidden email]> ---
attachment 35199 references a different file that was uploaded to Google Sheets
which is 2.75 MB instead of 8 KB in size.
Unzipping the 2.75 MB file expands to 27.9 MB (90% compression ratio is typical
for workbooks that do not contain embedded objects).
xl/worksheets/sheet1.xml is 27.9 MB (leaving a few KB for the rest of the
extracted files).

Inspecting the XML, I see that the workbook defines 1 cell at A1 and 1 million
rows:
<sheetFormatPr defaultColWidth="0" defaultRowHeight="15" zeroHeight="1"/>
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1">
    <c r="A1" t="s"><v>0</v></c>
  </row>
  <row r="2" spans="1:1" hidden="1"/>
  <row r="3" spans="1:1" hidden="1"/>
  ...
  <row r="16" spans="1:1" hidden="1"/>
  <row r="17" hidden="1"/>
  ...
  <row r="1048556" hidden="1"/>
  <row r="1048557" hidden="1"/>
</sheetData>

POI uses a TreeMap<int rowNumber, XSSFRow row> to store these rows for fast
random and sequential access, at the cost of some memory. Oracle Java 8 docs
state that it implements TreeMap using a self-balancing pointer-based red-black
tree. This should be fine.

To test that your JVM can handle 1 million items in a TreeMap, do something
like the following:
private static final Random rand = new Random();

private Object createFakeRow(int nbytes) {
    byte[] row = new byte[nbytes];
    rand.nextBytes(row);
    return row;
}

SortedMap<Integer, Object> rows = new TreeMap<>();
for (int r=1; r<=1048557; r++) {
    // create an object that consumes 10 KB of RAM in place of a real XSSFRow
    rows.put(i, createFakeRow(10*1024));
}

If that works without issue, then the next focus is on what XMLBeans is doing
as it's reading Sheet1.xml.
You could look at how POI unzips the file into memory and reads each XML file
into an XML DOM using XMLBeans and how much extra memory is consumed by the CT
classes.
Before we blame XMLBeans, we'd have to fairly compare it with JAXB.

Here's a Google Drive link to this file since the file exceeds the file size
limit of bugzilla:
https://drive.google.com/file/d/0B2v9cndcBwIWeURpNXVrSjJYbDg/view

Based on your reported 9 GB of RAM, the amortized size of each XSSFRow would be
roughly 10 KB.

To make sure this really isn't a bug in POI's handling of hidden rows or
columns, we'd need to test for OOM on a workbook with the same number of rows
defined but all of them visible.

I think you have enough ideas here to try to figure out where the problem is,
which is needed before a potential fix can be written.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...