[Bug 64015] New: Avoid BitSet at least in xslf

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug 64015] New: Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

            Bug ID: 64015
           Summary: Avoid BitSet at least in xslf
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSLF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

On https://issues.apache.org/jira/browse/TIKA-3017, a user shared a 19kb pptx
file that requires 2GB to parse or else throws an OOM.  The problem appears to
be that BitSet is horribly memory inefficient on large integers and there are
some shape ids with very large numbers.  Let's switch this to a simple HashSet
and look for other BitSet oom vulnerabilities throughout the codebase.

Executor task launch worker for task 47360
 at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
 at java.util.Arrays.copyOf([JI)[J (Arrays.java:3308)
 at java.util.BitSet.ensureCapacity(I)V (BitSet.java:337)
 at java.util.BitSet.expandTo(I)V (BitSet.java:352)
 at java.util.BitSet.set(I)V (BitSet.java:447)
 at org.apache.poi.xslf.usermodel.XSLFSheet.registerShapeId(I)V
(XSLFSheet.java:123)
 at org.apache.poi.xslf.usermodel.XSLFDrawing.<init>

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #1 from Andreas Beeker <[hidden email]> ---
Please don't use HashSet and search for a sparse BitSet class instead. The
reason for BitSet is/was an efficient way of searching for the next available
id. looping through and testing for an id is not the way to go ...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #2 from Tim Allison <[hidden email]> ---
Sorry. I misunderstood.  Thank you.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #3 from Tim Allison <[hidden email]> ---
Turns out you can trigger an oom with -Xmx900m with just three integer values:

BitSet bs = new BitSet();
bs.set(1794795584);
bs.set(1270275541);
bs.set(2127760244);

Any objections to: https://github.com/brettwooldridge/SparseBitSet

I've run a handful of speed/memory/accuracy tests and it _seems_ far better
than Java's BitSet.

If there are no objections, should we copy/paste the class or import another
dependency?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #4 from Andreas Beeker <[hidden email]> ---
+1 ... I wanted to point out the same ... well it's one of the first google
matches anyway when searching for sparse bitset

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #5 from Tim Allison <[hidden email]> ---
What's your preference for adding a dependency vs copy/pasting one large class.

The benefit to copy/paste is that we can strip out object serialization, which
I now detest.  And, right, those are only private methods, but still...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #6 from Dominik Stadler <[hidden email]> ---
Copying means duplicating code and is usually not done unless the code is
"submitted" to the project. It would just hide the fact that use code from
other projects.

Using a dependency means we can more easily update to newer versions, so I
would prefer that.

Hopefully we can switch to a modern dependency management at some point to get
rid of the current changes in the build.xml for updated dependencies.

On the subject: I had good experiences with https://github.com/lemire/javaewah
in another project. This one also seems to be at least more actively maintained
than the other one and used by many more projects on GitHub, 316 vs. 16.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #7 from Tim Allison <[hidden email]> ---
Sounds good.  If we're adding a dependency, then, y, I'm happy to go with
javaewah.  Let me kick the same tires on that one.

The single class in wooldridge's was appealing if we were to copy/paste.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #8 from Andreas Beeker <[hidden email]> ---
-1 to use javaewah. I have only read the intro/readme.md about compressed
bitmaps, which says that the sparse scenario is not a use case for compressed
bitmaps.
I don't know how many shape ids would be used at maximum, but I guess something
like 20-30k should be the maximum goal.

The SparseBitSet is ASL licensed so I don't see a problem importing it into our
source - although I'm a bit puzzled how lines 4-5 of the source should be
treated ... and usually ASL licensed code should have the ASL header on top ...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #9 from Konstantin Gribov <[hidden email]> ---
I'd suggest to look at Lemire's RoaringBitmap
(https://github.com/RoaringBitmap/RoaringBitmap), it's a bit newer and it works
fine with both sparse and dense bitmaps.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #10 from Tim Allison <[hidden email]> ---
I ran two tests of roaring bitmap, java bitset and zaxxers sparsebitset.

I wanted to simulate "worst case scenario" which is random ints from 0 to
Integer.MAX_Value.

First test: 1000000 random numbers (well above anything we'd think we'd need)

roaring bitmap (tested with -Xmx128m)
48018 total ms, 539.5280898876405 avg ms

java bitset (tested with -Xmx2g)
23826 total ms, 267.70786516853934 avg ms

zaxxer (requires -Xmx256m)
27730 total ms, 311.5730337078652 avg ms


100000 random numbers (still a good deal more that we think we'd need)

roaring bitmap (tested with -Xmx128m)
6508 total ms, 73.12359550561797 avg ms

java bitset (tested with -Xmx2g)
7607 total ms, 85.47191011235955 avg ms

zaxxer (tested with -Xmx128m)
3353 total ms, 37.674157303370784 avg ms

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #11 from Tim Allison <[hidden email]> ---
I'm wary of microbenchmarks (especially when I write them), and these results
could be misleading.  I did include a warmup phase at.

At the very least, they show memory consumption is awful with Java, much better
with Zaxxer and still better with roaring.

Zaxxer is the fastest, with roaring doing better than Java at 100k, which is
well above what we think we'd need.

roaring does look to be better supported/more recent/more active community;
zaxxer has an interface that is closer to BitSet and would be an easier drop in
place kind of thing.  I did get an "IllegalArgumentException" with zaxxer on
one run, which gives me pause...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #12 from Tim Allison <[hidden email]> ---
Hackery of code:
        Random r = new Random();
        int sz = 100000;
        int iterations = 100;
        long absents = 0;
        long gets = 0;
        long totalElapsed = 0;
        int warmup = 10;
        int observations = 0;
        for (int it = 0; it < iterations; it++) {
            long start = System.currentTimeMillis();
            System.out.println("iteration " + it);
            int[] ints = new int[sz];
            for (int i = 0; i < sz; i++) {
                ints[i] = Math.abs(r.nextInt());
            }
     //       BitSet bs = new BitSet();
            RoaringBitmap rb = new RoaringBitmap();
            for (int i = 0; i < sz; i++) {
                /*bs.set(ints[i]);
                long absent = bs.nextClearBit(ints[i]);
                absents += absent;
                if (bs.get(Math.abs(r.nextInt()))) {
                    gets++;
                }*/

                rb.add(ints[i]);
                long absent = rb.nextAbsentValue(ints[i]);
                absents += absent;
                if (rb.contains(Math.abs(r.nextInt()))) {
                    gets++;
                }
            }
            if (it > warmup) {
                long elapsed = System.currentTimeMillis() - start;
                System.out.println(absents + " " + gets + " " + elapsed);
                totalElapsed += elapsed;
                observations++;
            }
        }
        System.out.println(totalElapsed + " total ms, " +
                (double) totalElapsed / (double) observations + " avg ms");

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #13 from Tim Allison <[hidden email]> ---
Sorry.  The exception I got with Zaxxer was when I tried to get an
Integer.MAX_VALUE:

Exception in thread "main" java.lang.IndexOutOfBoundsException: i=2147483647

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #14 from Tim Allison <[hidden email]> ---
And ewah is far, far, far slower...I didn't even wait for completion.

Times on 100000 random integers:
iteration 0 19198ms
iteration 1 18822ms
iteration 2 19034ms

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #15 from PJ Fanning <[hidden email]> ---
could we implement a wrapping API and allow people to plugin their own
implementations (providing a reference implementation that is the default)?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #16 from Andreas Beeker <[hidden email]> ---
(In reply to PJ Fanning from comment #15)
> could we implement a wrapping API and allow people to plugin their own
> implementations (providing a reference implementation that is the default)?

I hope we can achieve this without adding another ThreadLocal.
With all those tweaks we provide elsewhere, I think it's time to think about
some kind of POI context/configuration class, which can be optionally provided
on the top level usermodel classes.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #17 from Tim Allison <[hidden email]> ---
>I hope we can achieve this without adding another ThreadLocal.
+1

I propose picking roaring or zaxxer and just running with it for now.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Avoid BitSet at least in xslf

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

--- Comment #18 from Tim Allison <[hidden email]> ---
Created attachment 36927
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=36927&action=edit
PATCH

I'm attaching a first draft of a patch with zaxxer.  I'm not going to commit it
and head off on holiday. :)

Longer term, Robert Muir recommended Lucene's SparseBitSet, and it looks good.
I've opened https://issues.apache.org/jira/browse/COLLECTIONS-743 and will
submit a patch there.  If we can get it into collections, we can drop the
zaxxer dependency.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64015] Replace usage of Java BitSet with a memory-optimized/sparse implementation

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64015

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Avoid BitSet at least in    |Replace usage of Java
                   |xslf                        |BitSet with a
                   |                            |memory-optimized/sparse
                   |                            |implementation

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12