[Bug 59747] New: xlsx file does not conform to bit patterns used by common file type detection software

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 59747] New: xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

            Bug ID: 59747
           Summary: xlsx file does not conform to bit patterns used by
                    common file type detection software
           Product: POI
           Version: 3.14-FINAL
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSSF
          Assignee: [hidden email]
          Reporter: [hidden email]

Hi,

I'm creating this bug due to a problem we've encountered with POI generated
xlsx files.

Apparently the order of zip entries in xlsx files is important for tools which
determine the file type be matching a byte pattern. See for example Apache Tika
(without deeper OOXML support library) and linux's file command.

The OOXML spec and Excel have no problem with POI files but tools relying on a
certain pattern have.

Here the output of unzip -l on a POI xlsx file:

Archive:  poi.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
      591  02.06.16 12:40   _rels/.rels
     1063  02.06.16 12:40   [Content_Types].xml
      183  02.06.16 12:40   docProps/app.xml
      437  02.06.16 12:40   docProps/core.xml
      137  02.06.16 12:40   xl/sharedStrings.xml
      818  02.06.16 12:40   xl/styles.xml
      349  02.06.16 12:40   xl/workbook.xml
      569  02.06.16 12:40   xl/_rels/workbook.xml.rels
      670  02.06.16 12:40   xl/worksheets/sheet1.xml
 --------                   -------
     4817                   9 files

And for a native file:

Archive:  excel.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
     1032  01.01.80 00:00   [Content_Types].xml
      588  01.01.80 00:00   _rels/.rels
      557  01.01.80 00:00   xl/_rels/workbook.xml.rels
      906  01.01.80 00:00   xl/workbook.xml
     1542  01.01.80 00:00   xl/styles.xml
     6790  01.01.80 00:00   xl/theme/theme1.xml
     1306  01.01.80 00:00   xl/worksheets/sheet1.xml
      593  01.01.80 00:00   docProps/core.xml
      816  01.01.80 00:00   docProps/app.xml
 --------                   -------
    14130                   9 files

According to linux file and Tika they seem to expect [Content_Types].xml as the
first entry, skip the second and look for a "xl/" in the third entry.

Would it be possible to fix the order of the entries?

We've written a simple post processing tool which rewrites the zip file but
would be happy to have this in POI proper.

Thanks and contact me if I can help.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 59747] xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

Nick Burch <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #1 from Nick Burch <[hidden email]> ---
Apart from a handful of formats (eg those which require a mimetypes file that's
uncompressed as the first entry in the zip), reliably detecting container
formats can only be done by opening up the container itself

Apache Tika ships with a special detector for zip-based container formats for
this very reason!

(Tika also, on trunk, correctly detects POI-generated OOXML files as OOXML from
mime magic only)

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 59747] xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

--- Comment #2 from Mark Murphy <[hidden email]> ---
Seems to me, those tools that rely on a specific file order within an archive
have a design flaw, that is, they rely on a specific file order within the
archive. Apparently Tika does not have that issue, but anything that does will
have an issue if Excel ever changes the order in which it writes files to the
xlsx archive. It apparently doesn't care what the order is, so there is no
guarantee the order will remain the same in future versions of the product.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 59747] xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

--- Comment #3 from Dominik Mähl <[hidden email]> ---
I agree with both of you. But I'm also convinced that Excel will be (and is)
seen as the reference implementation for ooxml. I can give you the name of at
least one commercial content filtering product which ships with the mentioned
bit patterns.

Also the change for tika was committed just yesterday :-)

(https://github.com/apache/tika/commit/52ea9ba7c2e3c99e7a2d4fb38875caa996438857)

To be clear. I know that this approach is flawed but it seems to me that it is
a standard practice and maybe it is easier to "fix" in POI than in every tool
out there.

If someone would point me to how to do it I would happily create a patch or
pull request or whatever. It's just that by looking at the POI code I could not
find an easy way to do it.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 59747] xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

--- Comment #4 from Javen O'Neal <[hidden email]> ---
Here's a start:
$ grep --recursive --files-with-matches --exclude-dir=".svn" -E
"CONTENT_TYPES_PART_NAME|Content_Types|_rels|\.rels|RELATIONSHIP_PART"
--include=*.java src/ooxml/java/org/apache/poi/openxml4j/opc

src/ooxml/java/org/apache/poi/openxml4j/opc/PackageRelationship.java
src/ooxml/java/org/apache/poi/openxml4j/opc/PackagePartName.java
src/ooxml/java/org/apache/poi/openxml4j/opc/OPCPackage.java
src/ooxml/java/org/apache/poi/openxml4j/opc/ZipPackage.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ContentTypeManager.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipHelper.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipContentTypeManager.java
src/ooxml/java/org/apache/poi/openxml4j/opc/PackagingURIHelper.java

I did a quick glance over and ZipPackage#getPartsImpl and the TreeMap partList
looked potentially relevant, but couldn't figure it out if this is where the
order is being set. Also, it's possible that the content manager needs to be
created before the rels, which may make it difficult to simply rearrange the
code to get the _rels directory to be created first. Seems more logical to me
for files in higher directories to be created before files in lower
directories.

--
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 59747] xlsx file does not conform to bit patterns used by common file type detection software

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Dominik Stadler <[hidden email]> ---
A fix for this was actually quite easy, just exchanging the order of writing
the two files in ZipPackage.saveImpl().

I have done this in r1809357. If it causes issues we may need to revert this,
though!

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]