Test document Tika-792

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Test document Tika-792

Mark Murphy
I am working on Bug 61787 where documents containing rsidDel=000000 are not
extracting the correct text. The issue is that rsidxxx attributes are just
there to indicate which revision a particular change belongs to, but does
not necessarily indicate that a particular revision actually occurred. Bug
58067 corrected an issue where deleted text was being returned by the
XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
keying on the rsidDel attribute rather than the delText tag which
specifically means that this is the deleted text.

So I corrected this in XWPFParagraph and XWPFRun. Now a test on document
Tika-792 is failing because it is expecting getText to return deleted text.
So what do you want to do?

The options as I see them are

   1. To allow getText() to return all text, even deleted text, than add
   another method to only return undeleted text.
   2. Change getText() to return undeleted text, and add another method to
   retrieve all text.

I prefer the second option, and I suspect that the Tika test is not
particularly valid as it's comment is that it's purpose is to include
CTBookmark classes in ooxmlLite. Tim, do you have a preference here as my
change will likely affect you the most.
Reply | Threaded
Open this post in threaded view
|

Re: Test document Tika-792

kiwiwings
There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration
ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:

> I am working on Bug 61787 where documents containing rsidDel=000000 are not
> extracting the correct text. The issue is that rsidxxx attributes are just
> there to indicate which revision a particular change belongs to, but does
> not necessarily indicate that a particular revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as my
> change will likely affect you the most.
>


signature.asc (495 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Test document Tika-792

Allison, Timothy B.
I'd prefer to avoid ThreadLocal if possible.

Could we add an enum for type of run?  Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available?

In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's getText().

-----Original Message-----
From: Andreas Beeker [mailto:[hidden email]]
Sent: Sunday, December 3, 2017 4:15 PM
To: [hidden email]
Subject: Re: Test document Tika-792

There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:

> I am working on Bug 61787 where documents containing rsidDel=000000
> are not extracting the correct text. The issue is that rsidxxx
> attributes are just there to indicate which revision a particular
> change belongs to, but does not necessarily indicate that a particular
> revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on
> document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as
> my change will likely affect you the most.
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Test document Tika-792

Murphy, Mark
My other thought was to add a method with a signature like getText(boolean includeDeleted). Calling without the boolean would simply make the call to the new method passing false. That would change behavior for existing code to omit deleted text, but in my mind, the typical use case would only want active text. There is the Tika test, and another test that checks return of deleted text which would need to be adjusted. I would want to determine how HWPF works with deleted text.

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Monday, December 04, 2017 10:43 AM
To: POI Developers List <[hidden email]>
Subject: RE: Test document Tika-792

I'd prefer to avoid ThreadLocal if possible.

Could we add an enum for type of run?  Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available?

In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's getText().

-----Original Message-----
From: Andreas Beeker [mailto:[hidden email]]
Sent: Sunday, December 3, 2017 4:15 PM
To: [hidden email]
Subject: Re: Test document Tika-792

There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:

> I am working on Bug 61787 where documents containing rsidDel=000000
> are not extracting the correct text. The issue is that rsidxxx
> attributes are just there to indicate which revision a particular
> change belongs to, but does not necessarily indicate that a particular
> revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on
> document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as
> my change will likely affect you the most.
>


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[ K \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Test document Tika-792

Allison, Timothy B.
getText(includeDeleted) is on the right track, I think.  Given that a run can have ruby text and/or be a formula and/or be any one of the many things on 4139-4140, should we be more specific?  getDeletedText(), getRuby(), etc...

-----Original Message-----
From: Murphy, Mark [mailto:[hidden email]]
Sent: Monday, December 4, 2017 1:10 PM
To: 'POI Developers List' <[hidden email]>
Subject: RE: Test document Tika-792

My other thought was to add a method with a signature like getText(boolean includeDeleted). Calling without the boolean would simply make the call to the new method passing false. That would change behavior for existing code to omit deleted text, but in my mind, the typical use case would only want active text. There is the Tika test, and another test that checks return of deleted text which would need to be adjusted. I would want to determine how HWPF works with deleted text.

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Monday, December 04, 2017 10:43 AM
To: POI Developers List <[hidden email]>
Subject: RE: Test document Tika-792

I'd prefer to avoid ThreadLocal if possible.

Could we add an enum for type of run?  Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available?

In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's getText().

-----Original Message-----
From: Andreas Beeker [mailto:[hidden email]]
Sent: Sunday, December 3, 2017 4:15 PM
To: [hidden email]
Subject: Re: Test document Tika-792

There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:

> I am working on Bug 61787 where documents containing rsidDel=000000
> are not extracting the correct text. The issue is that rsidxxx
> attributes are just there to indicate which revision a particular
> change belongs to, but does not necessarily indicate that a particular
> revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on
> document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as
> my change will likely affect you the most.
>


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[  ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[  ] Z[ K \X K ܙ B B B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[
 ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[ K \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Test document Tika-792

Murphy, Mark
I thought about that, but then how does the user know when deleted text or a Ruby exists. As it is, getText retrieves Ruby text. Or if we just retrieve Deleted text, how does the user know where it came from? My thought  concerning getText(includeDeleted) is that all the text including deleted text would be returned, of course you wouldn't know which text is deleted in that case.

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Tuesday, December 05, 2017 8:37 AM
To: POI Developers List <[hidden email]>
Subject: RE: Test document Tika-792

getText(includeDeleted) is on the right track, I think.  Given that a run can have ruby text and/or be a formula and/or be any one of the many things on 4139-4140, should we be more specific?  getDeletedText(), getRuby(), etc...

-----Original Message-----
From: Murphy, Mark [mailto:[hidden email]]
Sent: Monday, December 4, 2017 1:10 PM
To: 'POI Developers List' <[hidden email]>
Subject: RE: Test document Tika-792

My other thought was to add a method with a signature like getText(boolean includeDeleted). Calling without the boolean would simply make the call to the new method passing false. That would change behavior for existing code to omit deleted text, but in my mind, the typical use case would only want active text. There is the Tika test, and another test that checks return of deleted text which would need to be adjusted. I would want to determine how HWPF works with deleted text.

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Monday, December 04, 2017 10:43 AM
To: POI Developers List <[hidden email]>
Subject: RE: Test document Tika-792

I'd prefer to avoid ThreadLocal if possible.

Could we add an enum for type of run?  Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available?

In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's getText().

-----Original Message-----
From: Andreas Beeker [mailto:[hidden email]]
Sent: Sunday, December 3, 2017 4:15 PM
To: [hidden email]
Subject: Re: Test document Tika-792

There's also a third (and probably more ...) option:
You could set an option/enum/boolean as a ThreadLocal to decide how to output the data.
The default might be not to return deleted text and Tika could change it when calling the extractor.

As the library would fill up with ThreadLocals, I wonder if we could have a central configuration ThreadLocal (or similar) instance?

Andi

On 12/3/17 1:34 AM, Mark Murphy wrote:

> I am working on Bug 61787 where documents containing rsidDel=000000
> are not extracting the correct text. The issue is that rsidxxx
> attributes are just there to indicate which revision a particular
> change belongs to, but does not necessarily indicate that a particular
> revision actually occurred. Bug
> 58067 corrected an issue where deleted text was being returned by the
> XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
> keying on the rsidDel attribute rather than the delText tag which
> specifically means that this is the deleted text.
>
> So I corrected this in XWPFParagraph and XWPFRun. Now a test on
> document
> Tika-792 is failing because it is expecting getText to return deleted text.
> So what do you want to do?
>
> The options as I see them are
>
>    1. To allow getText() to return all text, even deleted text, than add
>    another method to only return undeleted text.
>    2. Change getText() to return undeleted text, and add another method to
>    retrieve all text.
>
> I prefer the second option, and I suspect that the Tika test is not
> particularly valid as it's comment is that it's purpose is to include
> CTBookmark classes in ooxmlLite. Tim, do you have a preference here as
> my change will likely affect you the most.
>


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[  ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[  ] Z[ K \X K ܙ B B B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  X  ܚX KK[XZ[  ] ][  X  ܚX P K \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[  ] Z[ K \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]