[Bug 61251] New: Out of memory when opening the DOCX file

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] New: Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

            Bug ID: 61251
           Summary: Out of memory when opening the DOCX file
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35097
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35097&action=edit
XWPF OOM bug

Hi guys

I have a OOM when opening one particular docx file.

POI Versions I tried:
3.15
3.16
3.17-beta1

The code is simple:

        InputStream in = new FileInputStream(new File(path));
        XWPFDocument document = new XWPFDocument(in);

Exception details:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown
Source)
        at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeData(Unknown
Source)
        at org.apache.xerces.dom.ElementNSImpl.getNamespaceURI(Unknown Source)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
        at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
        at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385)
        at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370)
        at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
        at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
        at
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
Source)
        at
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:152)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
        at
org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

Tim Allison <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #1 from Tim Allison <[hidden email]> ---
Thank you for opening this issue.

Do you want to modify the document, or are you only interested in text/metadata
extraction?  If extraction only, I added a SAX parser in Apache Tika, which is
far more efficient than our DOM parser.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

--- Comment #2 from PJ Fanning <[hidden email]> ---
I think POI uses more memory if you use the InputStream constructors.
Could you try creating an OPCPackage based on the File?
https://poi.apache.org/apidocs/org/apache/poi/openxml4j/opc/OPCPackage.html
And then create the XWPFDocument based on the OPCPackage?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

--- Comment #3 from Tim Allison <[hidden email]> ---
Y, agreed, PJ.  I'm not having any trouble parsing this with Tika and our usual
DOM parser even with -Xmx128m, and we use the OPCPackage from the file.  I am
able to replicate with -Xmx64m.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

--- Comment #4 from [hidden email] ---
Still getting OOM with

final OPCPackage in = OPCPackage.open(new File(path));
XWPFDocument document = new XWPFDocument(in);

Am I doing it right?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

--- Comment #5 from PJ Fanning <[hidden email]> ---
@bsevryukov your code is correct.
As Tim highlights, it seems that you need to increase your Xmx setting.
The approach using OPCPackage will use less memory but XWPFDocument is not
based on streaming the document - so the larger the docx, the more memory
XWPFDocument needs.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

--- Comment #6 from [hidden email] ---
Thanks PJ. Increasing Xmx size helped.

Thank you guys for the fast response.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61251] Out of memory when opening the DOCX file

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61251

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |WORKSFORME
             Status|NEW                         |RESOLVED

--- Comment #7 from Dominik Stadler <[hidden email]> ---
Fixed based on latest comment.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...