[Bug 60316] New: When processing glossary components, we need to return grandparent for getXWPFDocument

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 60316] New: When processing glossary components, we need to return grandparent for getXWPFDocument

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

            Bug ID: 60316
           Summary: When processing glossary components, we need to return
                    grandparent for getXWPFDocument
           Product: POI
           Version: 3.16-dev
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: [hidden email]
          Reporter: [hidden email]

On TIKA-2147 and TIKA-2149, Seva Alekseyev and Sharath Kumar shared two
documents that throw:

java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast
to org.apache.poi.xwpf.usermodel.XWPFDocument
at
org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
at org.apache.poi.xwpf.usermodel.XWPFFootnote.<init>(XWPFFootnote.java:47)
at
org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
at
org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
at
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
at
org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)

I think the issue is that the because the footnotes are within a glossary, when
we call getXWPFDocument(), we're invoking .getParent() which gets the
POIXMLDocumentPart.  If we get the grandparent in this case, we actually get
the XWPFDocument.

I propose something along these lines:

    public XWPFDocument getXWPFDocument() {
        if (document != null) {
            return document;
        } else {
            Object parent = getParent();
            if (parent != null) {
                if (parent instanceof XWPFDocument) {
                    return (XWPFDocument)parent;
                } else if (parent instanceof POIXMLDocumentPart) {
                    Object grandParent = ((POIXMLDocumentPart)
parent).getParent();
                    if (grandParent instanceof XWPFDocument) {
                        return (XWPFDocument) grandParent;
                    }
                }
            }
            throw new IllegalStateException("couldn't find the parent");
        }
    }

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] When processing glossary components, we need to return grandparent for getXWPFDocument

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #1 from Tim Allison <[hidden email]> ---
On further review, and given TIKA-2163, it looks like this is a whole new
kettle of worms.  The proposed fix is incorrect duct tape over a far larger
issue.

We aren't currently handling the glossaryDocument as a special relationship
type.  Anyone have experience with glossaryDocument?  Looks like an entire
other document stored within the document...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] When processing glossary components, we need to return grandparent for getXWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #2 from Tim Allison <[hidden email]> ---
Does anyone have a recommendation for a more graceful outcome than a
ClassCastException for files with a GlossaryDocument?

I suspect the actual fix will take a nontrivial amount of work. I don’t want to
hide/forget the issue, but I also would prefer a different outcome...logging
perhaps?

This issue was recently raised on
https://issues.apache.org/jira/browse/TIKA-2769 via an elasticsearch issue. Our
current workaround on Tika is to recommend the SAX based docx parser.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] When processing glossary components, we need to return grandparent for getXWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #3 from Dominik Stadler <[hidden email]> ---
I would opt for more gracefully handling this, just because POI does not
support a feature it would be nice if it still can handle the document to some
degree, so a log would probably be more appropriate for now.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] Handle Glossary in XWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

Tim Allison <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|When processing glossary    |Handle Glossary in
                   |components, we need to      |XWPFDocument
                   |return grandparent for      |
                   |getXWPFDocument             |

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] Handle Glossary in XWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #4 from Tim Allison <[hidden email]> ---
Thank you, Dominik.

Unless there are objections, I'll try to add logging as a first step.  I'll
leave this ticket open for when someone has time to add the new capability.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] Handle Glossary in XWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #5 from Tim Allison <[hidden email]> ---
In r1845517, I added a check+log+skip to avoid a ClassCastException until we
have time to implement correct handling of a glossary document.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] Handle Glossary in XWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #6 from Tim Allison <[hidden email]> ---
I shouldn't have skipped "template" types.  I should have skipped "glossary"
types.  This leads to a regression where headers/footers are not extracted from
template documents.

Will commit fix and new unit test once local build/test/test-integration
completes successfully.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 60316] Handle Glossary in XWPFDocument

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

--- Comment #7 from Tim Allison <[hidden email]> ---
Fixed in r1847263

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]