XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Nick Burch-2
Hi All

Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
there's a user who was getting what they thought was an embedded XSLX file
out of a PPT, but finding it was an OLE2 wrapper with CompObj and Package
entries. The real XLSX was in the Package part. Passing the outer OLE2
stream to WorkbookFactory didn't work

What do people think here? Should we have WorkbookFactory spot this case,
grab the OOXML out of the POIFS and try to load that? Update HSLF to
optionally extract the OOXML out of the OLE2? Record the gotcha in the
docs somewhere? Something else?

Cheers
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

kiwiwings
Hi Nick,

 > Should we have WorkbookFactory spot this case, grab the OOXML out of the POIFS and try to load that?

Actually I've updated the factories to handle that case - it might not work ...
We should have an example in our test corpus - Dominik/Tim, can you provide a sample file for .ppt(x) / .xls(x)?

Best wishes,
Andi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Tim Allison
Does this meet the needs?

https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt

On Sun, Oct 11, 2020 at 5:09 PM Andreas Beeker <[hidden email]> wrote:

> Hi Nick,
>
>  > Should we have WorkbookFactory spot this case, grab the OOXML out of
> the POIFS and try to load that?
>
> Actually I've updated the factories to handle that case - it might not
> work ...
> We should have an example in our test corpus - Dominik/Tim, can you
> provide a sample file for .ppt(x) / .xls(x)?
>
> Best wishes,
> Andi
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

Nick Burch-2
In reply to this post by kiwiwings
On Sun, 11 Oct 2020, Andreas Beeker wrote:
>> Should we have WorkbookFactory spot this case, grab the OOXML out of the
> POIFS and try to load that?
>
> Actually I've updated the factories to handle that case - it might not work
> ...
> We should have an example in our test corpus - Dominik/Tim, can you provide a
> sample file for .ppt(x) / .xls(x)?

Looks like you're right, I'd missed those commits! Support is all there in
XSSFWorkbookFactory and friends.

I've added a unit test for this based on the sample file from Apache Tika

Thanks
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]