[Bug 64322] New: Very poor performance reading OLE2

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 64322] New: Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

            Bug ID: 64322
           Summary: Very poor performance reading OLE2
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: POIFS
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 37154
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=37154&action=edit
Part 1 of 3

Attached file with 3.7MB takes more than 10min to be parsed with POI-4.0.1.
Profiling with jvisualvm, most time is spent in IOUtils.readFully() (360s),
FileBackedDataSource.size() (248s) and FileBackedDataSource.read() (88s). Code
to reproduce below:

    public static void main(String[] args) {

        File file = new File("F:\\teste-files\\thumbs.db\\New
Folder\\Thumbs.db");
        try (POIFSFileSystem poiFS = new POIFSFileSystem(file)) {
            recurseDir(poiFS.getRoot());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static void recurseDir(DirectoryEntry dir) {
        for (Entry entry : dir) {
            try {
                if (entry instanceof DirectoryEntry) {
                    recurseDir((DirectoryEntry) entry);
                }
                if (entry instanceof DocumentEntry) {
                    DocumentEntry de = (DocumentEntry) entry;
                    HashMap<String, String> props = new HashMap<>();
                    try (DocumentInputStream dis = new DocumentInputStream(de))
{
                        props.put("name", de.getName());

                        if (PropertySet.isPropertySetStream(dis)) {
                            dis.mark(10000000);
                            PropertySet ps = null;
                            try {
                                ps = new PropertySet(dis);

                            } catch (UnsupportedEncodingException e) {
                                // ignore
                            }
                            if (ps != null) {
                                for (Section section : ps.getSections()) {
                                    for (Property p : section.getProperties())
{
                                        String prop = section.getDictionary()
!= null
                                                ?
section.getDictionary().get(p.getID())
                                                : String.valueOf(p.getID());
                                        if (p.getValue() != null)
                                            props.put("property_" + prop,
p.getValue().toString());
                                    }
                                }
                            }
                            dis.reset();
                        }
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }

        }

    }

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

--- Comment #1 from Luis Filipe Nassif <[hidden email]> ---
Created attachment 37155
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=37155&action=edit
Part 2 of 3

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

--- Comment #2 from Luis Filipe Nassif <[hidden email]> ---
Created attachment 37156
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=37156&action=edit
Part 3 of 3

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

Luis Filipe Nassif <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #3 from Luis Filipe Nassif <[hidden email]> ---
Needed to divide file into 3 7z segments because of file size upload limit.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #4 from Dominik Stadler <[hidden email]> ---
For me the code runs in 16seconds, what king of hardware/memory/cpu/disk are
you using here to let us see where the huge difference comes from?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Dominik Stadler <[hidden email]> ---
I could optimize this somewhat by some internal changes in how the file-channel
is handled, see changes done via r1877816.

However you should be able to make a large improvement by opening the file via
InputStream and not via java.io.File. This way Apache POI will use more memory,
but process the file much quicker.

For me runtime of the test went from 5 seconds to 200ms when using the other
constructor!

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

--- Comment #6 from Luis Filipe Nassif <[hidden email]> ---
Thank you, Dominik, for taking a look and for the patch! Unfortunately, I am
not able to reproduce the huge slowness anymore, probably because the previous
used disk was a hardware RAID0 in a marvel chipset, but I lost it and don't
have it anymore (fortunately I recovered the data).

Now the test takes 15-20 seconds to finish, with different disks in my machine.
And using the InputStream constructor, it ends in 300ms! Thank you very much
for pointing it!

But as I understood from POI code, the entire file is loaded in heap, right?
Sometimes I have large files to process, will check file size first before
calling the other constructor.

Thanks

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64322] Very poor performance reading OLE2

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64322

--- Comment #7 from Nick Burch <[hidden email]> ---
FYI for people here - a few years ago we were talking about having three
different open modes. Read-only, Read for write elsewhere (copy-on-write) and
Read Write (in-place write). If we implemented that, at least some modes (and
certainly read-only) could be optimised quite a bit. IIRC writing when growing
a file size would also be able to get a big boost

Sadly, I couldn't convince $DAYJOB of the benefits to fund the work, and no-one
else volunteered, so nothing happened. See
https://bz.apache.org/bugzilla/show_bug.cgi?id=60741 and threads on the dev
list at around that time for more

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]