[Bug 61354] New: Tika fails to get full HTML

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61354] New: Tika fails to get full HTML

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

            Bug ID: 61354
           Summary: Tika fails to get full HTML
           Product: POI
           Version: 3.17-dev
          Hardware: PC
            Status: NEW
          Severity: major
          Priority: P2
         Component: XWPF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35184
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35184&action=edit
MultipleBodyBug

Apache Tika fails to get full HTML if the Word Document has multiple body.  We
only get the data from the first body.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61354] Tika fails to get full HTML

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

Karthik Ramachandran <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61354] Tika fails to get full HTML

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

Karthik Ramachandran <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #1 from Karthik Ramachandran <[hidden email]> ---
Created attachment 35185
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35185&action=edit
Patch for reading all body

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61354] Tika fails to get full HTML

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

PJ Fanning <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #2 from PJ Fanning <[hidden email]> ---
Merged using https://svn.apache.org/repos/asf/poi/trunk@1803250

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Bug 61354] Tika fails to get full HTML

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61354

--- Comment #3 from Tim Allison <[hidden email]> ---
Karthik, Thank you for sharing a patch and triggering document!  PJ, thank you
for fixing this so quickly!

As a side note, Tika's experimental SAX parser for docx does extract
everything; and this is exactly one of the reasons that I added it -- so that
if we don't account for structural rareties(?), we'll still get the text.  With
our DOM model, we're looking for some specific things in specific places (see
also TIKA-1130).

Make no mistake, we need to fix our DOM parser when people find problems, and
I'm grateful that you opened this!

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...