[Bug 60707] New: [PATCH] Reading very large excel files using StAX made easier.

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug 60707] New: [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

            Bug ID: 60707
           Summary: [PATCH] Reading very large excel files using StAX made
                    easier.
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SXSSF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 34731
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34731&action=edit
Patch that contains the new classes anf files introduced for enabling this
functionality.

This API helps the user to 'read' very large excel files and return the data in
String format. It represents the excel file as an instance of
‘org.apache.poi.xssf.streaming.reader.StreamedWorkbook’. This workbook can be
sub divided into StreamedSheets> StreamedRows > StreamedCells. StreamedCells
are the basic building block. StreamedCell represent the excel cell and holds
the String representation of the excel cell value.  In order to reduce the
memory usage StreamedCell is restricted to store only the String value & Cell
Number. Since the string value is exactly same as seen in excel file, user can
format it to whatever type he requires.

Apart from that, it uses the pull parser(StAX api) for streaming, so that the
user has more control over the parsing. User can read N rows of data, process
it and then read the next N blocks as so on..

Patch attached. Please take a look and let me know your comments.

thanks,
Renjith

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |PatchAvailable
             Status|NEW                         |NEEDINFO

--- Comment #1 from Javen O'Neal <[hidden email]> ---
This looks like a good start and a complement to SXSSFWorbook, which is a
write-only streaming API.

The unit test for StreamingWorkbook is a nice touch, too.

In order to better integrate these classes within POI, they would need to
implement the Workbook, Sheet, Row, and Cell interfaces so that someone can
write generic Common SS code and switch between HSSF, XSSF, write-SXSSF and
sax-read-SXSSF.

It's fine to stub out most methods for now as either "will be supported in the
future" or "won't be supported due to memory footprint".

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #2 from Renjith R <[hidden email]> ---
Created attachment 34757
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34757&action=edit
Patch which implements the interfaces Cell, Row, Sheet & Workbook

I have implemented the Cell, Row, Sheet & Workbook interfaces for easy
integration with existing functionality.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

Renjith R <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #3 from Renjith R <[hidden email]> ---
I have implemented Cell, Row, Sheet & Workbook interfaces for easy integration
with existing functionality. Methods which needs more research to implement are
mentioned as 'Will be supported in future'. Others are also provided with
proper comments.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #4 from Renjith R <[hidden email]> ---
Gentle reminder. Did someone got a chance to look into?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #5 from Renjith R <[hidden email]> ---
I'll continue to add the functionalities and post the patches here. I will be
glad if someone got a chance to review my work. Let me know if you want me to
look at anything specific or please guide me if I am moving in a wrong
direction.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #6 from Dominik Stadler <[hidden email]> ---
I did a look, here some initial comments:
* The initial implementation looks already like a good start, thanks for
putting in the work
* Maybe we should throw "UnsupportedException" in the methods that are not
supported, this way the user immediately knows even if she does not look at the
JavaDoc?
* As it is complex new functionality, we might first add it to the "scratchpad"
project/source-folders so we can let it stabilise some more until we declare it
as "production ready" by moving it into the ooxml-part.
* Please remove the "// TODO Auto-generated method stub" comments with a
comment that explains why the method is empty or with an exception as stated
above or simply remove it
* Please try to format the code consistently according to our
coding-guidelines, see http://poi.apache.org/guidelines.html#CodeStyle, your
IDE will usually allow to define it and reformat a whole file in one go.
* Is there a way to not duplicate the date-formats in
StreamedSheetEventHandler? We already handle them in various places, e.g.
DateFormatConverter
* On testing I would love to see some test that kind of "trashes" the
implementation, i.e. take all spreadsheets that we have under test-data and run
the normal XSSFWorkbook/HSSFWorkbook and compare results to your implementation
as far as possible. This way we ensure that your code handles all the special
cases that can arise.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #7 from Renjith R <[hidden email]> ---
Thanks a lot for the Inputs, Dominik. Let me work out on these points and I'll
get back to you.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #8 from Renjith R <[hidden email]> ---
Created attachment 34943
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34943&action=edit
Version 3: Patch with some of the review comments implemented

Hi..
I have tried to implement most of the comments. I am still working on reusing
the existing methods for reading the date from excel. Since it is
time-consuming, I am posting rest of the changes here. Kindly take and look and
let me know your queries, suggestions etc..

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #9 from Javen O'Neal <[hidden email]> ---
Could you also add a @NotImplemented annotation to methods where you throw
Unsupported operation exception? This will make it clear to the Javadocs
readers that the method isn't implemented yet.

https://poi.apache.org/apidocs/org/apache/poi/util/NotImplemented.html

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #10 from Renjith R <[hidden email]> ---
Created attachment 34945
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34945&action=edit
Version 4: Added @NotImplemented. Added feature getColumnIndex

Thanks a lot for the suggestion.
Added @NotImplemented.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #11 from Renjith R <[hidden email]> ---
I was analyzing 'DateFormatConverter' to see if I can reuse any logic from it.
But, as per my understanding, this class is more helpful while writing data to
excel. eg convert(Locale.JAPANESE, "dd MMMM, yyyy"), will generate a data
format string which can be used for applying a style to Cell. I'll keep on
searching. Please let me know if anyone has any clue on this.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #12 from Renjith R <[hidden email]> ---
I was analyzing 'DateFormatConverter' to see if I can reuse any logic from it.
But, as per my understanding, this class is more helpful while writing data to
excel. eg convert(Locale.JAPANESE, "dd MMMM, yyyy"), will generate a data
format string which can be used for applying a style to Cell. I'll keep on
searching. Please let me know if anyone has any clue on this.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #13 from Renjith R <[hidden email]> ---
Did someone get a chance to review the code?
I'd be glad to see some comments.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #14 from PJ Fanning <[hidden email]> ---
I added a StaxHelper class today because I think it is a good idea for us to
apply default configuration to the factories. One benefit is to protect against
XML Entity Expansion attacks.
Would you be able to uptake this?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #15 from Renjith R <[hidden email]> ---
Sure, I'll look into it.
Hope (https://bz.apache.org/bugzilla/show_bug.cgi?id=61213) is the one you are
talking about.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #16 from PJ Fanning <[hidden email]> ---
Renjith - the StaxHelper is already merged to the svn trunk (src/java). The
rest of the change for using StAX parser for issue 61213 is still under
discussion.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |61213


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61213
[Bug 61213] Replace SXSSFWorkbook copyStreamAndInjectWorksheet with StAX
equivalent
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60707] [PATCH] Reading very large excel files using StAX made easier.

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60707

--- Comment #17 from PJ Fanning <[hidden email]> ---
Renjith - could you attach SpreadSheetSample04022017.xlsx to the issue?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12