[Bug 61841] New: Unnecessary long computation when evaluating VLOOKUP on all column reference

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 61841] New: Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

            Bug ID: 61841
           Summary: Unnecessary long computation when evaluating VLOOKUP
                    on all column reference
           Product: POI
           Version: 3.15-FINAL
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SS Common
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Created attachment 35573
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35573&action=edit
Simple testcase

Hi all,
I spotted a problem in the evaluation of VLOOKUP function when the table area
is defined as all column reference.

Basically, when a "all column" reference is used, the evaluation code treats
the reference as if it's applied to the maximum number of rows available for
the workbook version.

In practice every reference as a $C:$D is translated as a $C$1:$D$1048576.
When such references are used a table argument for a VLOOKUP (exact match) and
the value to be searched is not found, the evaluation code loops for 2^20
times.

A proposed fix could be to translate the "all column" reference to a reference
that goes from row 1 to the maximum row of the considered sheet.

I attached a simple xlsx workbook that takes a *very long* time to evaluate
about 300 VLOOKUPs.

A fix could be to enhance the logic of
OperationEvaluationContext.getDynamicReference to translate the reference to a
narrow area.
I am available for any clarification.

Best regards,
    Luca

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

Luca Martini <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]
                   |                            |m, [hidden email],
                   |                            |[hidden email]
                 OS|                            |All

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

Nick Burch <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Nick Burch <[hidden email]> ---
Would you be able to put together a patch with the improved logic in?

See http://poi.apache.org/guidelines.html and
http://poi.apache.org/howtobuild.html and http://poi.apache.org/subversion.html
for more on building and contributing!

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #2 from Luca Martini <[hidden email]> ---
Nick,
I have already contributed some patches in the past.
However I am not sure to have the time to work to this fix at the moment.
I opened the bug for future reference and to check if someone with more
knowledge about evaluators could help.
In fact, at a first glance, the current workbook is not easy recoverable from
the WorkbookEvaluator without changing its interface and I am not sure that is
safe and sound to do so.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #3 from Javen O'Neal <[hidden email]> ---
Looking at the POI implementation of Vlookup.java, the class currently doesn't
have a reference to the sheet that contains the lookup table. This will make a
fix non-trivial.

The function that creates the AreaEval that is passed to the evaluate function
might have a reference to the lookup sheet.
Caveats: lookup table may be contained in a different workbook or sheet than
the cell that contains the formula that is being evaluated.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #4 from Greg Woolsey <[hidden email]> ---
It looks to me like the sheet is available:

Vlookup.evalutate() receives a ValueEval that is converted to a TwoDEval via
LookupUtils.resolveTableArrayArg().

TwoDEval interface has only one base class implementing it, AreaEvalBase.

AreaEvalBase abstract class has 2 non-test implementations, cached and lazy.
In both cases actual cell ValueEval references are available, which means
sheet/row/column eventually.  Seems to me we could augment the TwoDEval
interface to indicate the last physical values for rows and columns along with
the existing getWidth() and getHeight() methods.  These could be used when
creating ValueVector objects for the lookup iterations to reduce the vector
size to the maximum defined values, since no matches would exist outside those
anyway.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #5 from Greg Woolsey <[hidden email]> ---
Changes in r1817252

Interesting.  In a local test with the attached sample file, I found these
results:

45s (second run)
with current codebase issuing FormulaEvaluator.evaluateAll() on the workbook.

19s
By just changing XSSFEvaluationSheet.getCell(row, col) to immediately return
null if the row index > sheet.getLastRowNum()

14.4s
when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes
from a TreeMap.lastKey() which has to navigate the tree each time to find the
last key.

12.1s
after optimizing the blank cell tracking a bit to know about the last row with
data.

That's all without changing anything int he VLOOKUP evaluation and still
iterating over the max # of rows per column.

Of this remaining time, about 2/3 is taken up in the formula evaluation caching
and tracking mechanism.  Bypassing it for null cells causes test failures,
which shows it is necessary, but relatively expensive.  It appears to try to
optimize and minimize the "empty cell" rectangular regions it holds. but
assumes processing by row then column.  That may be a memory/time optimization
we want to consider allowing additional strategies for.

Note that this shortcut logic doesn't change the result of any methods, only
avoids busywork that didn't apply to the "nonexistent cell" cases.

This doesn't optimize VLOOKUP directly, but is about 70% improvement
sufficient?

Changing the VLOOKUP code itself is actually significantly more complex,
because POI handles sheets by row internally, and columns are second-class
constructs.  There is no easy way to determine the last row with data in a
column other than iterating over all defined rows.  With these optimizations,
the extra iterations should fail fast.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #6 from Javen O'Neal <[hidden email]> ---
(In reply to Greg Woolsey from comment #5)
> 14.4s
> when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes
> from a TreeMap.lastKey() which has to navigate the tree each time to find
> the last key.

Side-note: should XSSFSheet cache the last row rather than iterating over the
TreeMap for every call? Hopefully we could drop in a better Sorted map
implementation, one that keeps a pointer to the first and last keys.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #7 from PJ Fanning <[hidden email]> ---
Can we call the new method getLastRowNum instead of getlastRowNum?
Also, a lot of public methods have had their signatures changed. Do we need to
keep the existing signatures too and deprecate the old versions of the methods?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #8 from Luca Martini <[hidden email]> ---
(In reply to Greg Woolsey from comment #5)
> Changes in r1817252
>

Thank you very much Greg.


> Of this remaining time, about 2/3 is taken up in the formula evaluation
> caching and tracking mechanism.  Bypassing it for null cells causes test
> failures, which shows it is necessary, but relatively expensive.  It appears
> to try to optimize and minimize the "empty cell" rectangular regions it
> holds. but assumes processing by row then column.  That may be a memory/time
> optimization we want to consider allowing additional strategies for.
>
> Note that this shortcut logic doesn't change the result of any methods, only
> avoids busywork that didn't apply to the "nonexistent cell" cases.
>
> This doesn't optimize VLOOKUP directly, but is about 70% improvement
> sufficient?

I think so. That's more or less the fix I had in mind.

>
> Changing the VLOOKUP code itself is actually significantly more complex,
> because POI handles sheets by row internally, and columns are second-class
> constructs.  There is no easy way to determine the last row with data in a
> column other than iterating over all defined rows.  With these
> optimizations, the extra iterations should fail fast.

I know, and I think that with current state of the data structure, iterating
over every defined row could be worse than your current solution.
Here we are still on POI 3.x, but it should not be difficult to integrate your
changes in our forked version.

For me the bug is considered as resolved. I still do not change its status
because others have still pending comments.

Best regards,
    Luca

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #9 from Greg Woolsey <[hidden email]> ---
(In reply to PJ Fanning from comment #7)
> Can we call the new method getLastRowNum instead of getlastRowNum?
> Also, a lot of public methods have had their signatures changed. Do we need
> to keep the existing signatures too and deprecate the old versions of the
> methods?

Argh, hate it when I miss things like that.  Yes, fixed it in r1817325.

The "public" methods here are either in classes marked in docs as "POI
internal" or called only from methods similarly documented, so I think changing
their signatures is fine.  I'm not aware of anyone trying to hook into formula
evaluation down at the internal WorkbookEvaluator level.

I'm not opposed to duplicating and deprecating, but in this case I don't think
it's needed.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

Greg Woolsey <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |RESOLVED
         Resolution|---                         |FIXED

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841

--- Comment #10 from Javen O'Neal <[hidden email]> ---
Throw in some @Override's for good measure

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]