[Bug 60685] New: Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] New: Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

            Bug ID: 60685
           Summary: Unable to parse .pub files
                    -java.lang.ArrayIndexOutOfBoundsException: 88
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POI Overall
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 34710
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34710&action=edit
test document

When i try to parse the attached .pub file, it fails with the below exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: 88
at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:215)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.<init>(QCPLCBit.java:176)
at org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90)
at org.apache.poi.hpbf.model.QuillContents.<init>(QuillContents.java:71)
at org.apache.poi.hpbf.HPBFDocument.<init>(HPBFDocument.java:67)
at
org.apache.poi.hpbf.extractor.PublisherTextExtractor.<init>(PublisherTextExtractor.java:45)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 28 more

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

--- Comment #1 from Javen O'Neal <[hidden email]> ---
Knowing nothing about the Compound File Binary Format (is this the same as or a
predecessor to OLE2 containers?) [1.2]

CHNKINK record offset = 0x8200
QC Bit offset = 0x8340 - 0x8200 = 0x0140
Annotated contents of data[offset:offset+24]:
          +0    | +2          | +6    | +8    | +10   | +12         | +16      
  | +20
          recID | thingType   | optA  | optB  | optC  | bitType     | from    
  | len
00008340  18 00 | 54 4f 4b 4e | 00 00 | 01 00 | 00 00 | 50 4c 43 20 | 32 62 00
00 | 58 00 00 00
data      QCBit | "TOKN"      | false | true  | false | "PLC "      | 0x6232  
  | 0x58 = 88 bytes


Location    Len Hex Value    Field      Meaning (Little Endian conv, ASCII, hex
to dec, etc)
00008200+00 [8] 43 48 4e 4b 49 4e 4b 20 "CHNKINK "
...
00008340+00 [2] 18 00        QC Bit recID
00008340+02 [4] 54 4f 4b 4e  thingType  "TOKN"
00008340+06 [2] 00 00        optA       0x0000 -> false
00008340+08 [2] 01 00        optB       0x0001 -> true
00008340+10 [2] 00 00        optC       0x0000 -> false
00008340+12 [4] 50 4c 43 20  bitType    "PLC "
00008340+16 [4] 32 62 00 00  data from  0x6232, the byte offset from the
beginning of the CHNKINK record at 0x8200
00008340+20 [4] 58 00 00 00  data len   0x58 = 88 bytes
...
And the raw QCPLCBit record at 0x8200+0x6232=0xe432:
0000e430        03 00 00 00 0c 00  00 00 ff ff 01 00 06 01    |..............|
0000e440  00 00 11 01 00 00 4e 07  00 00 5a 07 00 00 16 00  |......N...Z.....|
0000e450  00 00 00 22 00 06 00 00  01 22 09 00 00 00 02 22  |..."....."....."|
0000e460  07 00 00 00 0a 00 00 00  01 22 0f 00 00 00 0a 00  |........."......|
0000e470  00 00 01 22 0a 00 00 00  0a 00 00 00 00 22 ff ff  |..."........."..|
0000e480  ff ff 04 00 00 00 04 00  00 00                    |..........|

Interpreting the QCPLCBit:
0000e432+0  03 00 00 00   3       number of PLCs
0000e432+4  0c 00 00 00   Type12 (holds hyperlinks, complicated) type of PLCs
...

The QC Bit header specifies the QC PLC Bit record has a length of 88 bytes.
The QCPLCBit specifies it contains 3 hyperlink PLCs (Type 12, 0x0c).
From how I interpret the current code, there's no way that 3 PLC hyperlinks can
be specified in 88 bytes.
> oneStartsAt = 0x4c
> twoStartsAt = 0x68
> threePlusIncrement = 22
Therefore three probably starts at 0x68+22=0x7e and ends at 0x68+22*2=0x94
With 0x58=88 bytes, there aren't even enough bytes for a second, let alone a
third PLC.

I guess we'd have to consult [MS-CFB][2] to figure out if this QCPLCBit record
really can be 88 bytes long or if the file is corrupt and silently skips over
reading these hyperlinks.

[1] https://en.wikipedia.org/wiki/Compound_File_Binary_Format
[2] https://msdn.microsoft.com/en-us/library/dd942138.aspx

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

--- Comment #2 from Javen O'Neal <[hidden email]> ---
The last real change to supporting HPBF hyperlinks was nearly 9 years ago, and
even then the commit message indicated partial hyperlink support. So it's quite
likely that we haven't fully implemented all hyperlink variations.

I expected to see some hyperlink URL as a string in the hexdump, but perhaps
this is a hyperlink to another element within the document.

Nonetheless, there are some nuggets of insight in the comments and variable
names to figure out what's going on in this QC PLC hyperlink bit.
https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hpbf/model/qcbits/QCPLCBit.java?r1=690729&r2=690534

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

Javen O'Neal <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|POI Overall                 |HPBF

--- Comment #3 from Javen O'Neal <[hidden email]> ---
The Microsoft Publisher binary .pub format is undocumented, as indicated here:
https://poi.apache.org/hpbf/index.html

OpenOffice/LibreOffice doesn't have documentation or an open source application
that reads this .pub format, to my knowledge, so that means we'd have to resort
to figuring out the format through lots of hard work.

Assuming the file you have provided is valid (opens without warnings or errors
in Microsoft Publisher), if you're mostly interested in text extraction, then
skipping over this hyperlink is probably preferable over throwing an exception.
We can log the error that we catch and move forward with extraction.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

--- Comment #4 from Javen O'Neal <[hidden email]> ---
Workaround applied in r1801405. Will be included in POI 3.17 beta 2.

Looking for any volunteers willing to experiment with the .pub format and
extend our documented understanding here:
https://poi.apache.org/hpbf/file-format.html

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 60685] Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=60685

--- Comment #5 from Tim Allison <[hidden email]> ---
Looks like we have ~8500 publisher files in our regression corpus if those
would be of any interest.  Some are likely truncated...so it goes w Common
Crawl.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...