streaming detection of OLE?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

streaming detection of OLE?

Tim Allison
All,
  In Tika, when we do file type detection of OLE files
(POIFSContainerDetector), we spool the file to disk, open a POIFS and
make a decision based on document/directory names.  A user on
TIKA-2849 does not want to copy the full file from a slow network
drive for detection.  When I tried using a BoundedInputStream with
POIFS, not surprisingly, I got EOF exceptions.
  Question: is there any way to do detection in a streaming mode for
OLE files?  Or, is this the best we can do?  Thank you!

       Best,

                     Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

Dave Fisher-5
Hi Tim,

Maybe the answer is using HPSF -

https://poi.apache.org/components/hpsf/how-to.html

Regards,
Dave

> On Apr 16, 2019, at 11:47 AM, Tim Allison <[hidden email]> wrote:
>
> All,
>  In Tika, when we do file type detection of OLE files
> (POIFSContainerDetector), we spool the file to disk, open a POIFS and
> make a decision based on document/directory names.  A user on
> TIKA-2849 does not want to copy the full file from a slow network
> drive for detection.  When I tried using a BoundedInputStream with
> POIFS, not surprisingly, I got EOF exceptions.
>  Question: is there any way to do detection in a streaming mode for
> OLE files?  Or, is this the best we can do?  Thank you!
>
>       Best,
>
>                     Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

Tim Allison
Thank you, Dave!  The reading examples use POIFSReader, which I had hoped
was truly streaming, but it creates a POIFS, which requires a read/skip of
the entire stream IIUC, and then iterates...Or, am I missing something?

I didn’t try POIFSReader by specifying a subdoc to process, but it looks
like it opens a POIFS first no matter how you register a listener.

On Tue, Apr 16, 2019 at 3:20 PM Dave Fisher <[hidden email]> wrote:

> Hi Tim,
>
> Maybe the answer is using HPSF -
>
> https://poi.apache.org/components/hpsf/how-to.html
>
> Regards,
> Dave
>
> > On Apr 16, 2019, at 11:47 AM, Tim Allison <[hidden email]> wrote:
> >
> > All,
> >  In Tika, when we do file type detection of OLE files
> > (POIFSContainerDetector), we spool the file to disk, open a POIFS and
> > make a decision based on document/directory names.  A user on
> > TIKA-2849 does not want to copy the full file from a slow network
> > drive for detection.  When I tried using a BoundedInputStream with
> > POIFS, not surprisingly, I got EOF exceptions.
> >  Question: is there any way to do detection in a streaming mode for
> > OLE files?  Or, is this the best we can do?  Thank you!
> >
> >       Best,
> >
> >                     Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

kiwiwings
In reply to this post by Tim Allison
Hi Tim,

when you say detection, you mean the type of OLE document? , e.g. word, excel, powerpoint ...

Instead of reading the directory entries, you could also read the storage_clsid of the root record.
But I haven't checked yet, where this information is located in the file.

Andi



signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

Dave Fisher-5
In reply to this post by Tim Allison
Hi -

Well it’s early POI stuff. Maybe a patch is possible for the narrow use case the Tika user has.

I assume that all you need is the first block or two to confirm this looks like an OLE document.

Regards,
Dave

> On Apr 16, 2019, at 12:29 PM, Tim Allison <[hidden email]> wrote:
>
> Thank you, Dave!  The reading examples use POIFSReader, which I had hoped
> was truly streaming, but it creates a POIFS, which requires a read/skip of
> the entire stream IIUC, and then iterates...Or, am I missing something?
>
> I didn’t try POIFSReader by specifying a subdoc to process, but it looks
> like it opens a POIFS first no matter how you register a listener.
>
> On Tue, Apr 16, 2019 at 3:20 PM Dave Fisher <[hidden email]> wrote:
>
>> Hi Tim,
>>
>> Maybe the answer is using HPSF -
>>
>> https://poi.apache.org/components/hpsf/how-to.html
>>
>> Regards,
>> Dave
>>
>>> On Apr 16, 2019, at 11:47 AM, Tim Allison <[hidden email]> wrote:
>>>
>>> All,
>>> In Tika, when we do file type detection of OLE files
>>> (POIFSContainerDetector), we spool the file to disk, open a POIFS and
>>> make a decision based on document/directory names.  A user on
>>> TIKA-2849 does not want to copy the full file from a slow network
>>> drive for detection.  When I tried using a BoundedInputStream with
>>> POIFS, not surprisingly, I got EOF exceptions.
>>> Question: is there any way to do detection in a streaming mode for
>>> OLE files?  Or, is this the best we can do?  Thank you!
>>>
>>>      Best,
>>>
>>>                    Tim
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

Tim Allison
In reply to this post by kiwiwings
>e.g. word, excel, powerpoint ...
Y and a half dozen or so other non-MS OLE file types like StarOffice etc.

>clsid

Will take a look. Thank you!

On Tue, Apr 16, 2019 at 3:35 PM Andreas Beeker <[hidden email]> wrote:

> Hi Tim,
>
> when you say detection, you mean the type of OLE document? , e.g. word,
> excel, powerpoint ...
>
> Instead of reading the directory entries, you could also read the
> storage_clsid of the root record.
> But I haven't checked yet, where this information is located in the file.
>
> Andi
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

Tim Allison
In reply to this post by Dave Fisher-5
>Maybe a patch is possible for the narrow use case the Tika user has

Y. Will take a closer look at hpsf and POIFS. Definitely belongs in POI.
Thank you!

On Tue, Apr 16, 2019 at 3:45 PM Dave Fisher <[hidden email]> wrote:

> Hi -
>
> Well it’s early POI stuff. Maybe a patch is possible for the narrow use
> case the Tika user has.
>
> I assume that all you need is the first block or two to confirm this looks
> like an OLE document.
>
> Regards,
> Dave
>
> > On Apr 16, 2019, at 12:29 PM, Tim Allison <[hidden email]> wrote:
> >
> > Thank you, Dave!  The reading examples use POIFSReader, which I had hoped
> > was truly streaming, but it creates a POIFS, which requires a read/skip
> of
> > the entire stream IIUC, and then iterates...Or, am I missing something?
> >
> > I didn’t try POIFSReader by specifying a subdoc to process, but it looks
> > like it opens a POIFS first no matter how you register a listener.
> >
> > On Tue, Apr 16, 2019 at 3:20 PM Dave Fisher <[hidden email]>
> wrote:
> >
> >> Hi Tim,
> >>
> >> Maybe the answer is using HPSF -
> >>
> >> https://poi.apache.org/components/hpsf/how-to.html
> >>
> >> Regards,
> >> Dave
> >>
> >>> On Apr 16, 2019, at 11:47 AM, Tim Allison <[hidden email]> wrote:
> >>>
> >>> All,
> >>> In Tika, when we do file type detection of OLE files
> >>> (POIFSContainerDetector), we spool the file to disk, open a POIFS and
> >>> make a decision based on document/directory names.  A user on
> >>> TIKA-2849 does not want to copy the full file from a slow network
> >>> drive for detection.  When I tried using a BoundedInputStream with
> >>> POIFS, not surprisingly, I got EOF exceptions.
> >>> Question: is there any way to do detection in a streaming mode for
> >>> OLE files?  Or, is this the best we can do?  Thank you!
> >>>
> >>>      Best,
> >>>
> >>>                    Tim
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: streaming detection of OLE?

kiwiwings
In reply to this post by Tim Allison
On some files the properties record is located in the end.
If you read the POIFS HeaderBlock (see POIFSFileSystem constructor), you know from HeaderBlock._property_start the first block containing properties.
When accessing the network file via http/webdav, you can then jump to that range [1]

This is all just an idea - we provide the helper functionality via HeaderBlock and PropertySet, but I don't think we should provide the network range logic into POI.

Andi

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests



signature.asc (499 bytes) Download Attachment