[Bug 61267] New: Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] New: Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

            Bug ID: 61267
           Summary: Meta data of attached word file gets parsed. However,
                    content of file is not parsed and is blank
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POI Overall
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35106
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35106&action=edit
Meta data of attached word file gets parsed. However, content of file is not
parsed and is blank

Meta data of attached word file gets parsed. However, content of file is not
parsed and is blank

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
           Severity|normal                      |major

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Javen O'Neal <[hidden email]> ---
The file begins with the following bytes:
> 00000000  db a5 2d 00 31 40 09 04  00 00 00 00 2d 00 00 00  |..-.1@......-...|

And has quite a bit of ASCII embedded in it. This doesn't look like a OLE2
BIFF8 Microsoft Word .doc file nor an OOXML Word .docx file. This looks more
like a Microsoft Write .wri file, though has a different magic number.

> 00000180  09 4d 65 6d 62 65 72 20  6f 66 20 33 47 50 50 20  |.Member of 3GPP |
> 00000190  28 41 52 49 42 29 0d 0a  4d 72 2e 20 42 65 6e 6e  |(ARIB)..Mr. Benn|

Furthermore, I cannot open this file with Google Docs.

Are you sure this is a Microsoft Word file?
I wasn't able to find any common uses of this magic number.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

--- Comment #2 from Javen O'Neal <[hidden email]> ---
Nevermind. Looks like this claims to be a Word 2.0 file.

http://www.filesignatures.net/index.php?page=search&search=DBA52D00&mode=SIG
> DB A5 2D 00   Word 2.0 file, ASCII

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Extract text from Microsoft Word 2.0 (pre-OLE2) document

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Meta data of attached word  |Extract text from Microsoft
                   |file gets parsed. However,  |Word 2.0 (pre-OLE2)
                   |content of file is not      |document
                   |parsed and is blank         |
           Severity|major                       |enhancement

--- Comment #3 from Javen O'Neal <[hidden email]> ---
There are several entry points into POI. We should figure out what class should
be responsible for checking the first few bytes (magic number) of a file to
figure out what file format it is (Tika style).

We could continue adding known magic numbers to o.a.p.poifs.HeaderBlock, but we
may want to reuse that code elsewhere, such as
WorkbookFactory/DocumentFactory/SlideshowFactory, the Extractor classes for
Tika, etc.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61267] Extract text from Microsoft Word 2.0 (pre-OLE2) document

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61267

--- Comment #4 from Nick Burch <[hidden email]> ---
Tip for next time - run the Tika App jar in --detect mode to see if the file
magic is known. In this case, Tika knows it's application/msword2

pre-OLE2 word2 has 2 magics, word5 has 1 (at least that Tika knows about), do
people think it's worth adding helpful exceptions in POI for those too?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...