[Bug 61832] New: Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug 61832] New: Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

            Bug ID: 61832
           Summary: Unable to create a excel of 500,000 rows and 150
                    column using SXSSF workbook
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: SXSSF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35565
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35565&action=edit
"GenerateExcelFile.txt" contains the java code.

I am trying to generate a excel with SXSSF workbook. The size of excel may
shoot to 200-300MB, whose row count will be 500,000 (i.e 0.5 million) and
column count will be around 150 approximately..! I get content issue while
opening such large excel file.
The version which I am using is APACHE POI 3.9
I had also tried with, 3.15 the same issue exist.

System holds enough ram and disk space as per the requirement.

3.9 performance seems to better than other versions, Is there any way to rule
this issue out in that version.?

Anyone, who has idea on above scenario, kindly help me to proceed further.

FYI, I hadn't applied any styles or format for the cells, just inserted a data
of 25 character length. I hadn't violated/crossed any excel
specification/limits.

Please find the sample code as below.

The alert message which I receive on opening the excel file is,

"We found a problem with some content in 'filename'. Do you want us to try to
recover as much as we can? If you trust the source of this workbook"

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #1 from Thamodharan <[hidden email]> ---
Created attachment 35566
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35566&action=edit
Excel 2013 - Alert Message - While opening such large file

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #2 from Yegor Kozlov <[hidden email]> ---
(In reply to Thamodharan from comment #1)
> Created attachment 35566 [details]
> Excel 2013 - Alert Message - While opening such large file

Is it a 32 or 64-bit Excel ? I wonder if it matters.

What happens if you click Yes and agree to recover as much as possible ?

Also, try to disable automatic re-calculation of formulas in Options / Formulas
/ Calculation Options. By default, Excel recalculates all the formulas in a
worksheet on open. You'd better turn it off for such large files.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #3 from Thamodharan <[hidden email]> ---
(In reply to Yegor Kozlov from comment #2)

> (In reply to Thamodharan from comment #1)
> > Created attachment 35566 [details]
> > Excel 2013 - Alert Message - While opening such large file
>
> Is it a 32 or 64-bit Excel ? I wonder if it matters.
>
> What happens if you click Yes and agree to recover as much as possible ?
>
> Also, try to disable automatic re-calculation of formulas in Options /
> Formulas / Calculation Options. By default, Excel recalculates all the
> formulas in a worksheet on open. You'd better turn it off for such large
> files.

Hi Yegor Kozlov,

FYI,
The above attached code works good, if the number of rows has been reduced to
0.3 million rows (i.e 300,000) and 150 column. I can open the file without any
issue or error. But the same doesn't happen for 0.4 or 0.5 million rows.


Response to your questions.,

 1)  Its 32-bit Excel 2013.

 2)  When I click Yes to recover, excel stops responding few minutes and later
i am able to retrieve all data without any loss, but I get repair error report
in an alert box. Please find new attachments related to it.

     - Once, I save the repaired file, I didn't get the content issue pop up
when I open it again or later.

 3)  We just tried to insert string, no numeric or formulas are applied to the
workbook. Though, I tried your suggestion.

     - Before generating the workbook, I turned off the formula in local excel
file and saved it, checked once again by opening, it remains turned off. But
once the code is executed successfully, the formula is again enabled
automatically. Same Content issue persist for large files.

Kindly help if there is any further solution.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #4 from Thamodharan <[hidden email]> ---
Created attachment 35571
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35571&action=edit
ExcelErrorAfterRepair

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #5 from Yegor Kozlov <[hidden email]> ---
Can you please do one more test: generate a 500K rows x 150 columns spreadsheet
containing only numbers and see if it opens? Due to its streaming nature SXSSF
stores strings differently than Excel does: Excel puts strings in a shared
cache while SXXSF stores strings inline. Number cells should be fully
compatible . Lets see.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #6 from Thamodharan <[hidden email]> ---
(In reply to Yegor Kozlov from comment #5)
> Can you please do one more test: generate a 500K rows x 150 columns
> spreadsheet containing only numbers and see if it opens? Due to its
> streaming nature SXSSF stores strings differently than Excel does: Excel
> puts strings in a shared cache while SXXSF stores strings inline. Number
> cells should be fully compatible . Lets see.

Hi Yegor Kozlov,

I had tried as you suggested, it worked fine for 500K X 150 column while
inserting numbers.The maximum int limit which we can give is 2147483647.
cell.setCellValue(2147483647);
Thus it worked fine.

So, I extended the testing by increasing the row count and column count,
i.e., 900K X 250 Column. Same cellValue 2147483647.
The workbook is done successful. File size reached to 504 MB.

But,The same content issue occurs while opening.

Still as a mystery.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #7 from Greg Woolsey <[hidden email]> ---
I ran the attached code, and examined the generated workbook.  All XML
validates successfully with xmllint.  The worksheet file, however, expands to >
5GB due to all the inlineStr attributes for all the cells.

My suspicion is that Excel doesn't like that many inline cell string values.  I
verified that the change between the original and "repaired" versions was
exactly turning those inlineStr attributes into a single shared string value
referenced by all the cells.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #8 from Thamodharan <[hidden email]> ---
Hi Greg Woolsey,

SXSSF by default takes inline string, it can be changed to shared string for
which some patch is available. Since Shared string takes the heap space memory,
its difficult to opt for it, due to resource constraint. Is there any luck, to
get it done with SXSSF inline itself ?

Regards
Thamodharan B

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #9 from Yegor Kozlov <[hidden email]> ---
In theory SXSSF can use a fixed-size cache of shared strings, e.g. we can allow
SXSSF to put 10K distinct strings in the SST. If the cache is full then the
string goes inline. The 10K limit can be configurable.
This way SXSSF-generated workbooks with small number of distinct values will be
fully compatible with XSSF.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #10 from Thamodharan <[hidden email]> ---
Hi Yegor Kozlov,

Hope, you mean to set this limit 10K as mentioned below.

SXSSFWorkbook wb = new SXSSFWorkbook(10000);

Though I set this, I face content issue while opening it.
I tried extending it to 30K, the same persist.

If so my try is wrong from what you suggest. Kindly correct it.

I have limited heap space, but have enough disk space. Though I plan to extend
the memory, and store in SST, I cant run parallel workbook. I can do only one,
though i increase twice or thrice as of now.

My system RAM is 8GB, I had allotted 1 GB max to jboss.

Kindly suggest your view on it.

Regards
Thamodharan B

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #11 from Yegor Kozlov <[hidden email]> ---
try to construct SXSSFWorkbook in a different way:

  wb = new SXSSFWorkbook(
    null, /* template workbook. pass null to start with a blank workbook */
    1000, /* size of the sliding window*/
    true, /* whether to compress temp files.*/
    true  /* whether to use shared string table*/
  );



not that it may increase the memory footprint, especially if the number of
distinct strings is large .

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #12 from Thamodharan <[hidden email]> ---
(In reply to Yegor Kozlov from comment #11)

> try to construct SXSSFWorkbook in a different way:
>
>   wb = new SXSSFWorkbook(
>     null, /* template workbook. pass null to start with a blank workbook */
>     1000, /* size of the sliding window*/
>     true, /* whether to compress temp files.*/
>     true  /* whether to use shared string table*/
>   );
>
>
>
> not that it may increase the memory footprint, especially if the number of
> distinct strings is large .


Hi Yegor Kozlov,

As you stated, it worked fine, if a same string is written in all the cells
(500 Row X 150 Column). Even it is good if same string written in (750 rows X
150 Column).

Thus by digging more about it,I came to know how shared string table concept
works.

But for more number of distinct string that need to be inserted, i face java
heap space out of memory error, since shared string takes all distinct in into
heap memory.

Help me to improve it further.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #13 from Greg Woolsey <[hidden email]> ---
At some point, you just have to allocate more heap, no way around it. Sounds
like you might be there, needing to allow for a very large shared strings
table.  Some tasks just take a large amount of RAM, which is cheap these days,
including the ability to "rent" time and space on any number of cloud provider
platforms to achieve tasks that don't warrant purchasing your own additional
hardware.

Only other thing I can think of would be to write a patch for POI yourself that
adds an option for streaming and shared strings to store the hash table on disk
somehow.  That would save RAM but be incredibly slow.  Better off just giving
the VM 64GB of RAM or something.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #14 from Thamodharan <[hidden email]> ---
Though I adopt 64GB system RAM and 30GB to JBOSS heap space, I cant even
produce two parallel workbook of such large size at a time.

In case of adopting to write a patch, speed is concerned as you said.

Is there any possibility to overcome this issue in forthcoming version?

Anyhow, I will give a try to write patch, can you please guide or provide any
reference to write a patch to store SST on disk as you suggested.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #15 from [hidden email] ---
I do not agree with this at all: "Some tasks just take a large amount of RAM,
which is cheap these days, including the ability to "rent" time and space on
any number of cloud provider platforms [...]".

This is the kind of reasoning that leads to bloatware and all around bad
products and bad libraries. Here is an approach that might work for improving
performance well within the OP requirements. I'd be interested to hear any
reasons this would not work.

1. shared string table uses a lookup that only tracks hash and index
2. worksheets and shared string table are output as streams to temporary files
(optionally compressed prior to write, which often saves time given speed of
CPU and sluggishness of write operations)
3. after all pieces from #2 are created, the final xlsx is assembled

Assuming a 256-bit hash, 1 mm rows, and 150 columns, and 100% non-distinct
string values (obviously unlikely) step 1 will take a maximum of
(256+32)/8*150*1mm = 5.4GB of memory. This could be further streamlined by not
hashing string values of under 32 bytes and/or using a 128-bit hash, and could
be further supplemented by supporting hash collisions. Of course there are
libraries that will take care of this so it shouldn't be much work to
incorporate. Something that is performance-oriented like sqlite3 comes to mind
as a possibility.

Step 2 will take minimal fixed memory because it is streaming the data to disk.

Step 3 can release all memory before starting, and then stream all the pieces,
in serial, into the final output (and implicitly through a zip streamer), which
can be either streamed to the calling process or to disk and thereby take up
minimal fixed memory.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #16 from PJ Fanning <[hidden email]> ---
I have the initial work done on a prototype version of SharedStringTable that
can be used in SXSSF to reduce the memory footprint.
The Pull Request is for discussion as opposed to suggesting this work is
anywhere close to being merged yet.
https://github.com/apache/poi/pull/85 

The general idea is to put the shared string data in an H2 MVTable backed by a
temp file. This is not done yet.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #17 from Thamodharan <[hidden email]> ---
Hi mewalig,

Does 1mm what you mentioned refer to 0.1 million or 1 million rows ?

Can you please explain this calculation [(256+32)/8*150*1mm = 5.4GB of memory]
in detail, so that it would be easy for me to calculate as per my requirement.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #18 from Thamodharan <[hidden email]> ---
Hi mewalig,

Kindly help us by explaining the calculation which you stated on your last
reply.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61832] Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #19 from Thamodharan <[hidden email]> ---
My machine specs,

Jboss server
RAM 32 GB
Heap size allocated to jboss 28 GB
Harddisk 1 TB

I am good to go with generating excel file upto a limit of 0.3 million X 200
rows with enabling shared string. Ahead such row, i face corrupt file issue
when opening the generated excel. The reason for this error as stated by greg
woolsey is, the data's are stored as inline string in the disk, thus excel
couldn't tolerate putting such large number of inline string, thus repairing
it, turning inline to single shared string.

As suggested by greg woolsey, its better to go by enabling shared string, which
holds unique data in heap space instead of disk space. This idea may aid the
application to generate a single excel file in my machine with the above spec,
but generating 2 or more excel at same time, will lead to memory out.

I guess, writing a patch to store data in disk, rather in heap, will again lead
to inline string and same error may occur.

Can anyone from apache dev team or other followers, ignite a vice idea to rule
this error out ?

Waiting a good response

Thamodharan B

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12