[Bug 61266] New: File not parsing

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] New: File not parsing

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

            Bug ID: 61266
           Summary: File not parsing
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: critical
          Priority: P2
         Component: POI Overall
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35105
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35105&action=edit
DOC file

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.microsoft.OfficeParser@547eb45
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:357)
        at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:308)
        at
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
        at
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
        at javax.swing.TransferHandler.importData(Unknown Source)
        at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
        at java.awt.dnd.DropTarget.drop(Unknown Source)
        at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
        at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown
Source)
        at
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
Source)
        at
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown
Source)
        at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
        at java.awt.Component.dispatchEventImpl(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown
Source)
        at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Window.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
        at java.awt.EventQueue.access$500(Unknown Source)
        at java.awt.EventQueue$3.run(Unknown Source)
        at java.awt.EventQueue$3.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at java.awt.EventQueue$4.run(Unknown Source)
        at java.awt.EventQueue$4.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at
java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file
appears not to be a valid OLE2 document
        at
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:181)
        at
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
        at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 43 more

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Javen O'Neal <[hidden email]> ---
Same comment as bug 61265 and bug 61257, please provide a better bug title and
include the version of POI that you're using.

You can remove the javax.swing, java.awt, and sun calls in the stack trace.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |normal
          Component|POI Overall                 |POIFS

--- Comment #2 from Javen O'Neal <[hidden email]> ---
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file
appears not to be a valid OLE2 document

Google Docs reported your file as corrupt as well. Are you sure this is a valid
doc file and not encrypted?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #3 from [hidden email] ---
This is a valid DOC file. This is an old file (year 1991). When we open this
file (with text encoding as windows default) and resave it in docx format.
Then, the docx format gets parsed successfully.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |major

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

--- Comment #4 from Javen O'Neal <[hidden email]> ---
If this is a 1991 Word file, then perhaps HWPFOldDocument (for Word 6 and Word
95) should be used instead of HWPFDocument (BIFF8). It's possible that this
file format predates Word 6.

Not sure if POI or Tika should be specifying a different file handler, though
it's possible POI (and therefore Tika) can't read this ancient format.

The o.a.p.poifs.storage.HeaderBlock constructor recognizes that this file is
not a BIFF2, 3, or 4 document.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] File not parsing

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |enhancement

--- Comment #5 from Javen O'Neal <[hidden email]> ---
Looks like POI doesn't currently support reading this file format.

Opening the binary file in a text editor reveals that most of the document
contents are saved as ASCII, with a few special characters to embed figures and
designate the start of sections. This doesn't look like any OLE2 file I have
seen before.

Presumably if all that is needed is text extraction, you could use `strings` on
this document.

Changing this to an enhancement request in case someone is interested in
figuring out what archaic file format this is and writing a primitive parser
that can extract text from the document.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] Extract text from old Word file from 1991

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|File not parsing            |Extract text from old Word
                   |                            |file from 1991

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] Extract text from Microsoft Write document

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Extract text from old Word  |Extract text from Microsoft
                   |file from 1991              |Write document

--- Comment #6 from Javen O'Neal <[hidden email]> ---
Starting with 0x31be, the provided file is presumably a Microsoft Write file,
typically found with a .wri extension, though later saved with a .doc extension
and *optionally* saved in an OLE2 container (this file isn't). This format
dates back to the Windows 1.0 days (1985).

http://www.filesignatures.net/index.php?page=search&search=31BE&mode=SIG
https://en.wikipedia.org/wiki/Microsoft_Write

Strictly speaking, Write is not part of the Microsoft Office suite.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61266] Extract text from Microsoft Write document

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61266

--- Comment #7 from Nick Burch <[hidden email]> ---
I've added a more helpful exception for these files in r1801376, based on the
mime magic from Apache Tika for them

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...