[Bug 61470] New: Text with phonetic runs aren't extracted in docx

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 61470] New: Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

            Bug ID: 61470
           Summary: Text with phonetic runs aren't extracted in docx
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35269
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35269&action=edit
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within
"ruby" sections.  This means that neither the primary text ("東京") nor the
phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED
                 OS|                            |All

--- Comment #1 from Tim Allison <[hidden email]> ---
r1806712

Added extraction of runs within ruby elements; added ability for users to
select whether or not to concatenate phonetic runs; set default toString()
behavior to include phonetic runs.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

studio test <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |major
                 OS|All                         |Windows 7

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #2 from studio test <[hidden email]> ---
Created attachment 35271
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35271&action=edit
Test script

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #3 from Tim Allison <[hidden email]> ---
Spam?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #4 from Tim Allison <[hidden email]> ---
Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the
document and then dump it at the end.  This would allow for a document to be
found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #5 from Tim Allison <[hidden email]> ---
Created attachment 35275
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35275&action=edit
reason to cache...for posterity and a later issue

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61470] Text with phonetic runs aren't extracted in docx

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #35275|0                           |1
        is obsolete|                            |

--- Comment #6 from Tim Allison <[hidden email]> ---
Created attachment 35276
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35276&action=edit
reason to cache for posterity and a later issue

correct file attached this time

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]