[Bug 64418] New: Finding text in textfields is very slow

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 64418] New: Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

            Bug ID: 64418
           Summary: Finding text in textfields is very slow
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

I am scanning docx documents for occurences of specific words / search terms.

The code I am using is seen below.

The search terms can literally be anywhere: in header, footer, paragraphs,
tables, text fields, ...

When using an even complex document that uses no / very few textfields, parsing
takes a few seconds. As soon as multiple text fields are involved, parsing
takes a considerate amount of time, e.g. 30 seconds or even more than a minute.


Is there aynthing I am doing wrong in how I use the API, or is there an issue
with XWPF?

Thanks,
Jens



    private static void findInBodyElements(String key, List<IBodyElement>
bodyElements, ArrayList<String> resultList) {
        if (resultList.contains(key)) {
            return;
        }

        for (IBodyElement bodyElement : bodyElements) {
            if
(bodyElement.getElementType().compareTo(BodyElementType.PARAGRAPH) == 0) {
                findInParagraph(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }
                findInTextfield(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }

            }
            if (bodyElement.getElementType().compareTo(BodyElementType.TABLE)
== 0) {
                findInTable(key, (XWPFTable) bodyElement, resultList);

            }
        }
    }

    private static void findInParagraph(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        //for (XWPFParagraph paragraph : xwpfParagraphs) {
        List<XWPFRun> runs = xwpfParagraph.getRuns();

        String find = key;
        TextSegment found = xwpfParagraph.searchText(find, new
PositionInParagraph());
        if (found != null) {
            if (!resultList.contains(key)) {
                resultList.add(key);
                return;
            }
        }

    }

    private static void findInTextfield(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        XmlCursor cursor = xwpfParagraph.getCTP().newCursor();
        cursor.selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
.//*/w:txbxContent/w:p/w:r");

        List<XmlObject> ctrsintxtbx = new ArrayList<XmlObject>();

        while (cursor.hasNextSelection()) {
            cursor.toNextSelection();
            XmlObject obj = cursor.getObject();
            ctrsintxtbx.add(obj);
        }
        for (XmlObject obj : ctrsintxtbx) {
            try {
                CTR ctr = CTR.Factory.parse(obj.xmlText());
                XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
                String text = bufferrun.getText(0);
                if (text != null && text.contains(key)) {
                    if (!resultList.contains(key)) {
                        resultList.add(key);
                        return;
                    }
                }
            } catch (Exception ex) {
                log.error("Unable to iterate text fields", ex);
            }
        }

    }

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Dominik Stadler <[hidden email]> ---
Can you provide a sample file which shows the slowdown? Would make it much
easier to try to analyze/reproduce it.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #2 from j-lawyer.org <[hidden email]> ---
Thank you Dominik for the reply.

I just created a fully runnable example:
https://www.j-lawyer.org/temp/DocXShowCase.zip

It is a Netbeans project that includes runnable test case as well as example
documents. Both docx documents are comparable in complexity, one has no text
fields, the other one has 10 text fields.

When running the code, those are the performance numbers:

without textfields, search: 676
with textfields, search: 15678

So, when text fields are involved, there is 23x factor for execution times.

Let me know if I can provide anything else and I will be on top of it in no
time.

Thanks!
Jens / j-lawyer.org

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #3 from Dominik Stadler <[hidden email]> ---
Thanks, but unfortunately there is lots of code which is not related to the
problem and thus makes reproducing and analyzing this very hard. The app seems
to not finish for a very long time for me. It also looks a bit like you are
iterating over the contents of the document many times with all the
placeholders and some of the loops in your application.

Can you reduce the code in the sample project as much as possible so that it
still shows the problem, but does not do all the things that are only needed
for your application?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

--- Comment #4 from j-lawyer.org <[hidden email]> ---
Thanks Dominik for looking into this. I have stripped down the test case, the
URL is still the same: https://www.j-lawyer.org/temp/DocXShowCase.zip

- has a list of 50 strings to be searched in documents
- has two documents, both just 1 page - (a) has no textfields and (b) has 10
text fields
- each of the 50 strings is searched for using a loop, so i am iterating each
document fifty times

Basically I just want to know which of the 50 strings are contained in the
documents.

Thanks,
Jens

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Dominik Stadler <[hidden email]> ---
The following line is taking most of the CPU by far, so you likely need to
rework your code to not have to produce XML and then parse it in again
afterwards.

CTR.Factory.parse(obj.xmlText())

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from j-lawyer.org <[hidden email]> ---
Well, I would love to get rid of the expensive XML handling - however, I do not
see how I could avoid it given POIs API.

Is there an alternative approach for "getting all text content of text fields /
text boxes"?

Even Apache Tika seems to use the exact same approach in their
XWPFWordExtractorDecorator.java:

  331         // Also extract any paragraphs embedded in text boxes
  332         //Note "w:txbxContent//"...must look for all descendant
paragraphs
  333         //not just the immediate children of txbxContent -- TIKA-2807
  334         if (config.getIncludeShapeBasedContent()) {
  335             for (XmlObject embeddedParagraph :
paragraph.getCTP().selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare
namespace
wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
.//*/wps:txbx/w:txbxContent//w:p")) {
  336                 extractParagraph(new
XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()),
paragraph.getBody()), listManager, xhtml);
  337             }
  338         }


Am I missing something?

Thanks,
Jens

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 64418] Finding text in textfields is very slow

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

--- Comment #7 from PJ Fanning <[hidden email]> ---
Instead of `CTP.Factory.parse(embeddedParagraph.xmlText())` could you try
`CTP.Factory.parse(embeddedParagraph.getDomNode())`

This might lower the overhead of the parse call

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]