[Bug 61677] New: ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 61677] New: ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

            Bug ID: 61677
           Summary: ChainLoopDetector failing with
                    java.lang.ArrayIndexOutOfBoundsException: <negative
                    number>
           Product: POI
           Version: 3.17-FINAL
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: POIFS
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

I have an issue with a particular CFBF MSG file (unfortunately, due to
non-disclosure and privacy issues, I can't show this file), I am running into
an issue, where NPOIFSFileSytsem's/BlockStore's chain loop detector is failing
with this kind of error:

```
java.lang.ArrayIndexOutOfBoundsException: -1557822031
        at
org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99)
        at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:441)
```

Upon inspection of the code, my understanding is that what happens here is that
since int is signed (and in fact represents sector number within a valid range
of 0x00000000 to 0xFFFFFFF9), the algorithm used in ChainLoopDetector is
inherently flawed as it assumes that `offset` is always non-negative. However,
if you look into `NPOIFSFileSystem.readCoreContents` I can see that the same
value is also compared with the END_OF_CHAIN constant (reserved range), which
means that the whole range from 0x00000000 to 0xFFFFFFF9 might be passed over
to `ChainLoopDetector#claim`.

Can anybody confirm if my hunch is correct? I'll gladly send a patch for the
chain loop detector that'd work with `int` correctly.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #1 from Tim Allison <[hidden email]> ---
Created attachment 35470
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35470&action=edit
triggering file from govdocs1

The attached file from govdocs1 shows a similar stacktrace.  We have on the
order of 100-200 ish documents in Tika's public regression set that trigger a
similar trace.


java.lang.ArrayIndexOutOfBoundsException: -65537
    at
org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99)
    at
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:168)
    at
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142)
    at
org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:81)
    at
org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66)
    at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:440)
    at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:235)
    at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:168)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:122)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
    at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
    at
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #2 from Tim Allison <[hidden email]> ---
I'm not extremely familiar with that chunk of code, but your hunch sounds
right.  Thank you very much for digging through our code to find the cause.

Perhaps Nick might have an idea?

Please do submit a patch and feel free to use the attached document for your
unit test.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #3 from Tim Allison <[hidden email]> ---
The earlier file I attached and nearly all of the other files that I've
manually reviewed fail to open in MSWord.  This ppt[1] does open in MSPPT, but
there are some issues with the images towards the end.

Are you able to open your triggering file in Outlook?

[1] http://162.242.228.174/docs/govdocs1/679/679046.ppt

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #4 from Tim Allison <[hidden email]> ---
Created attachment 35471
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35471&action=edit
list of files with similar exception in Tika's regression corpus

prefix with http://162.242.228.174/docs  to download.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #5 from Javen O'Neal <[hidden email]> ---
Side-note: is anyone familiar with any Java developer tools that can analyze
source or bytecode to find where an infinite loop could occur? If such a tool
existed, would it be valuable to run this against our codebase as part of our
test suite?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #6 from Nick Burch <[hidden email]> ---
(In reply to Javen O'Neal from comment #5)
> Side-note: is anyone familiar with any Java developer tools that can analyze
> source or bytecode to find where an infinite loop could occur? If such a
> tool existed, would it be valuable to run this against our codebase as part
> of our test suite?

Semmle / LGTM might be able to do this. There's a free version for open source
projects - https://lgtm.com/

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #7 from Yurii Rashkovskii <[hidden email]> ---
Tim,

Thanks a lot for supplying additional files. Indeed, it does seem like the doc
files can't be opened by Word, so the point is moot there.

The PPT file you attached doesn't seem to fail ChainLoopDetector at all.
However, when I run it on my (unfortunately, undisclosable) file, it does trip
ChainLoopDetector#claim with a negative index.

However, once I replaced the boolean[] array with a HashSet<Integer> (to
prevent the negative indices problem), I ran into a new issue. It looks like
the problem is deeper than I originally thought.

At this point it trips FileBackedDataSource and its use of
FileChannel#position. Both expect longs, but since some of the legitimate
offsets NPOIFSFileSystem#getBlockAt() receives are interpreted as negatives
(since, again, int is signed, and the actual numbers in the format are unsigned
integers) it ends up passing a negative offset to FileBackedDataSource, which,
of course, fails.

The most reasonable solution, it seems to me, is to rectify the problem at
source. Using a signed integer where an unsigned one is expected is asking for
trouble (and getting it).

As this project should work on Java 1.6+, we can't use Java 8's
Integer.toUnsignedLong, but we could use a very simple helper function for
that.

Following through, I tried to convert the signed ints to unsigned ints (within
longs), only to get to further problem of FileBackedDataSource not be able to
read any bytes at the given offset (returning its own
IndexOutOfBoundsException).

So at this point it feels to me as there are two potential reasons for this:

1) The file is corrupt but somehow handled fine by Microsoft Office.
2) POIFS' understanding of the format (could it be a compatibility issue of
some kind?) is somehow wrong and the offsets read are not what they should be.

Thoughts?

---

Another issue is that it looks like the only file that seems to be working
perfectly well for Microsoft Office AND using the full range of unsigned int32
values for offsets is just that MSG file I can't disclose. I didn't check ALL
of your doc files, but the first sample I had all failed to be opened in
Office. With that in mind, if we can't find any other file like this, what
would be our course of action?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

--- Comment #8 from Yurii Rashkovskii <[hidden email]> ---
I investigated this issue further. It looks like Outlook is able to render the
email, one of the attachment is actually broken (too short). To me it feels
like Outlook (or whatever library they use for working with CFBF) is rather
lazy (won't read everything in upon initialization) and POIFS is effectively
more eager.

Do you think there is any truth to this and possible workarounds? I'd rather
have a portion of that email recovered than have nothing at all (effectively,
minic Outlook's behaviour)

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61677] ChainLoopDetector failing with java.lang.ArrayIndexOutOfBoundsException: <negative number>

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61677

Nick Burch <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #9 from Nick Burch <[hidden email]> ---
We're moving to Java 8 for the next major release, so using
Integer.toUnsignedLong might be possible going forward. However...

I think there are some hard-baked assumptions in the code that no valid block
will be beyond 2^31 mark. If that were to change, because of the way that Java
arrays work (and we generally want to use Arrays not Maps/Sets for performance
and memory utilisation), we'd have to change some places to potentially have a
second array for the "negative signed ints" offsets

For your problematic MSG file, how big is the file itself? And are these
negative offsets occurring in the Mini Stream (small blocks) or the Normal /
Main Stream (big blocks)? Can other tools (eg OpenOffice, libgsf) read the file
ok?

The spec <https://msdn.microsoft.com/en-us/library/dd942475.aspx> suggests that
the maximum allowed size of a 512-blocksize file (eg MSG or DOC) is 2gb, so we
shouldn't be anywhere near exceeding the 2^31 block index limit on big blocks.
Even a mini stream (64 byte blocks for a 512 bigblock file) shouldn't, by my
calculation, be able to exceed the 2^31 limit whilst still staying under 2gb
for the overall filesize.

I *think* that only 4096 byte block files should be able to get into the
0x80000000+ sector number range. Even if they did that with just the
ministream, that's still a 128gb+ test file to trigger it, unless my maths is
off!

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]