Proposal: make HSSF/XSSF/SXSSF behavior consistent

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: make HSSF/XSSF/SXSSF behavior consistent

Vladislav Galas
Correct me if I'm wrong: all three said implementations should behave identically in the common field (say, SXSSF doen't support array functions for understandable reasons) as long as SpreadsheetVersion constraints aren't violated.

However, I was trying to improved test coverage for *Cell class and found out that at times the implementations behave very differently. Understandably, this comes from three reasons:
a) written by different people at different times
b) lack of specifications on the interface
c) bloated interfaces.

So I could take the job of unifying the behavior. I forsee the objections like "we shouldn't break what users are accustomed to". Fair enough, but mainly for the "user-observable" part. I.e. values are set/got ok, formulae are evaluated. But then there are those differences in the methods users (I guess) don't actually use.

Say, we have a formula with a string cached value and a formula:
Cell cell = ...;
cell.setCellFormula("\"foo\");
cell.setCellValue(1);
ok, now
getCellType() == CellType.FORMULA,
getCachedFormulaResultType() == CellType.NUMERIC, and
getNumericCellValue() == 1.0.

So far so good. But if we say
cell.setCellType(CellType.NUMERIC), what will happen? Spoiler: anything can happen.
For HSSF and SXSSF it will be FORMULA (why??)
For XSSF it will be BLANK (why??)

Such discrepancies may happen whenever implicit value/type casting is performed, i.e. for setCellValue, get*CellValue, setCellType and setCellFormula.

So,
1. I can prepare a set of contracts for such functions and present them for discussion here. The passed through ones will be clearly stated in the Cell interface, tests updated (1 clause in the javadoc = 1 test), implementations fixed.

2. A group of stronger and more invasive specific proposals:
  a) mark setCellType @Deprecated. It's not something you can do in Excel, nor the implicit value conversion logic is reliable. setCellType(FORMULA) is meaningless. On the contrary, setBlank() should be a part of the public interface because it has a clear meaning: delete both value and formula, and preserve the cell with its style.
  b) make get*Value return the value only if the value type strictly corresponds to the method name. The evaluator performs conversions on its own, and I very much doubt that users rely on a conversion like "get the numeric value of the error code".
  c) at last, remove FORMULA from CellType and deprecate getCachedFormulaResultType. Why should anyone be concerned with the formula while reading a value? get*Value work identically, be formula set or not. I suspect there may be a reason for current state, though, because HSSF and XSSF implementations are strongly tied to the underlying xml beans. But there's still a good chance this removal is possible.
  d) probably, CellType._NONE can also be removed
  e) it also may be a good idea to remove BLANK as well (simultaneously adding isBlank() method), because it's "not a value".

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Vladislav Galas
Actually value/type conversion logic could (and should, imho) be moved to a superclass.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Greg Woolsey
Good thoughts - I'm glad you are on board and thinking about the
hard-to-track-down issues when implementations of interfaces differ in
their behavior.  I've not used the different hierarchies much, sticking
mostly to XSSF, but I can see how this could lead to tricky bugs.

I'm personally partial to default methods in interfaces now that we are up
to Java 8 for base support. That can help when the methods are well-defined
and don't need much instance state in cases where we want/need further
class distinctions and hierarchy on one branch of the implementations but
not others.  Moving to a common superclass locks in things that maybe we
want left more flexible.  I don't have an example off the top for POI, but
I've seen that other places.

However, if there is enough common behavior that it requires significant
instance state to perform, default methods become harder to maintain and a
superclass starts to look more appealing.

Looking forward to your suggestions and discussion topics.

Greg

On Fri, Jan 4, 2019 at 1:41 PM Vladislav Galas <[hidden email]> wrote:

> Actually value/type conversion logic could (and should, imho) be moved to
> a superclass.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Vladislav Galas
1. setCellType() should be deprecated. Any value/format conversions should be performed outside of the cell through its public interface.
    While it's still available, we'll have to stick with conversions. In the next list, for any previous type not specified explicitly, the default value for the new type will be set.
    1.0. In all cases except BLANK, (ignore formula and handle only values) OR (remove formula)? I vote for ignoring the formula. We operate on values here, not formula.
    1.1. _NONE: throw IllegalArgumentException
    1.2. BLANK: as is. Remove formula and value, keep style, comments etc. Should evolve into a setBlank() method someday.
    1.3. BOOLEAN. Default value: false.
        1.3.1 Strings: equalsIgnoreCase("true") ? true : false
        1.3.2. Numbers: getNumericValue() != 0
    1.4. NUMERIC. Default value: 0.
        1.4.1. Strings: Double.parseString, default value if fails.
        1.4.2. Bools: getBooleanValue ? 1 : 0.
    1.5. STRING. Default value: "" (empty string).
        1.5.1. Bools: getBooleanValue ? "TRUE" : "FALSE" (defined as static constants, at least in XSSF).
        1.5.2. Numbers: Double.toString(getNumericValue()).
    1.5. FORMULA. Throw IllegalArgumentException. Setting cell type to formula makes no sense.
2. setCellFormula()
    2.1. setCellFormula(!null). In our reference, Excel, it's seemingly impossible to set a formula without Excel immediately evaluating it, even in manual calculation mode.
        I tried to tinker with unpacked XML. It's possible to set value type in a formula cell to another type, and it's possible to remove the value (in this case, Excel immediately converts it to 0 on load).
        Therefore, I vote for keeping old value, if any. If the cell was blank, set value to 0. State when formula is set and value is not should be illegal.
    2.2. setCellFormula(null) shall keep the value by all means, that's what Excel does.
        A method removeFormula() should be added to Cell. Calling setCellFormula with null shall be discouraged by throwing an IllegalArgumentException telling user to call removeFormula().
3. getValueType() should be added as a transition. Implementation: return getCellType() == FORMULA ? getCachedFormulaResultType() : getCellType(). Possible values will be BLANK, BOOLEAN, NUMERIC, STRING, ERROR.
4. get*Value should throw IllegalStateException if stored value type doesn't match the method.
5. Usage of {byte getErrorCellValue()} and setErrorCellValue(byte) should be discouraged in favor of using FormulaError.
    Note: writing a cell with a CIRCULAR_REF to xlsx produces a corrupt file. I don't have a good soultion for this. Scanning all cells for a circular ref and handling it somehow is clumsy and costly.
6. setCellValue() should ignore the formula, if any, and simply write the value, changing the underlying storage, if necessary. Previous value shall be discarded.

That's it... Hope it helps us get a skinny yet complete interface and consistent implementations.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Greg Woolsey
I like this, but wonder about the change to throwing runtime exceptions
(IllegalArgumentException specifically) in new cases, especially where we
are deprecating a given _input_ but not an entire function, like
setCellFormula(null).  That seems like it will break a lot of downstream
code, and cause a bunch of bug reports and mailing list traffic from folks
who haven't read the release notes.  It may be worth it, but wanted to
bring it up as a consideration.

Of course this would need to be in a 4.1.0 version, not a 4.0.2 due to the
breaking changes to API behavior (even though we could call most of this
bug fixes).

On Fri, Jan 4, 2019 at 3:15 PM Vladislav Galas <[hidden email]> wrote:

> 1. setCellType() should be deprecated. Any value/format conversions should
> be performed outside of the cell through its public interface.
>     While it's still available, we'll have to stick with conversions. In
> the next list, for any previous type not specified explicitly, the default
> value for the new type will be set.
>     1.0. In all cases except BLANK, (ignore formula and handle only
> values) OR (remove formula)? I vote for ignoring the formula. We operate on
> values here, not formula.
>     1.1. _NONE: throw IllegalArgumentException
>     1.2. BLANK: as is. Remove formula and value, keep style, comments etc.
> Should evolve into a setBlank() method someday.
>     1.3. BOOLEAN. Default value: false.
>         1.3.1 Strings: equalsIgnoreCase("true") ? true : false
>         1.3.2. Numbers: getNumericValue() != 0
>     1.4. NUMERIC. Default value: 0.
>         1.4.1. Strings: Double.parseString, default value if fails.
>         1.4.2. Bools: getBooleanValue ? 1 : 0.
>     1.5. STRING. Default value: "" (empty string).
>         1.5.1. Bools: getBooleanValue ? "TRUE" : "FALSE" (defined as
> static constants, at least in XSSF).
>         1.5.2. Numbers: Double.toString(getNumericValue()).
>     1.5. FORMULA. Throw IllegalArgumentException. Setting cell type to
> formula makes no sense.
> 2. setCellFormula()
>     2.1. setCellFormula(!null). In our reference, Excel, it's seemingly
> impossible to set a formula without Excel immediately evaluating it, even
> in manual calculation mode.
>         I tried to tinker with unpacked XML. It's possible to set value
> type in a formula cell to another type, and it's possible to remove the
> value (in this case, Excel immediately converts it to 0 on load).
>         Therefore, I vote for keeping old value, if any. If the cell was
> blank, set value to 0. State when formula is set and value is not should be
> illegal.
>     2.2. setCellFormula(null) shall keep the value by all means, that's
> what Excel does.
>         A method removeFormula() should be added to Cell. Calling
> setCellFormula with null shall be discouraged by throwing an
> IllegalArgumentException telling user to call removeFormula().
> 3. getValueType() should be added as a transition. Implementation: return
> getCellType() == FORMULA ? getCachedFormulaResultType() : getCellType().
> Possible values will be BLANK, BOOLEAN, NUMERIC, STRING, ERROR.
> 4. get*Value should throw IllegalStateException if stored value type
> doesn't match the method.
> 5. Usage of {byte getErrorCellValue()} and setErrorCellValue(byte) should
> be discouraged in favor of using FormulaError.
>     Note: writing a cell with a CIRCULAR_REF to xlsx produces a corrupt
> file. I don't have a good soultion for this. Scanning all cells for a
> circular ref and handling it somehow is clumsy and costly.
> 6. setCellValue() should ignore the formula, if any, and simply write the
> value, changing the underlying storage, if necessary. Previous value shall
> be discarded.
>
> That's it... Hope it helps us get a skinny yet complete interface and
> consistent implementations.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Vladislav Galas
Sure, I understand that we should be careful.
W.r.t. to throwing, the first step could be stating in javadoc that passing null to setCellFormula shold be minimized, and removeFormula() should be used instead.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proposal: make HSSF/XSSF/SXSSF behavior consistent

Yegor Kozlov-4
In reply to this post by Vladislav Galas
 Unification of cell type convensions makes sense. Changing the logic to
manipulate cell type/value convenrsions via public methods is fine as long
as it does not break compatibility.

setCellType() was introduced in HSSFCell in POI 1.0 which was very
format-centric. It wasn't designed to mimic Excel's object model, it was
rather a usermodel that reflected our knowledge about the XLS Biff8 format
at that time.

Then XSSF came and implemenetd its own setCellType() semantics and I'm not
surprised that there is a divergence.

Regards,
Yegor

On Sat, Jan 5, 2019 at 12:17 AM Vladislav Galas <[hidden email]> wrote:

> Correct me if I'm wrong: all three said implementations should behave
> identically in the common field (say, SXSSF doen't support array functions
> for understandable reasons) as long as SpreadsheetVersion constraints
> aren't violated.
>
> However, I was trying to improved test coverage for *Cell class and found
> out that at times the implementations behave very differently.
> Understandably, this comes from three reasons:
> a) written by different people at different times
> b) lack of specifications on the interface
> c) bloated interfaces.
>
> So I could take the job of unifying the behavior. I forsee the objections
> like "we shouldn't break what users are accustomed to". Fair enough, but
> mainly for the "user-observable" part. I.e. values are set/got ok, formulae
> are evaluated. But then there are those differences in the methods users (I
> guess) don't actually use.
>
> Say, we have a formula with a string cached value and a formula:
> Cell cell = ...;
> cell.setCellFormula("\"foo\");
> cell.setCellValue(1);
> ok, now
> getCellType() == CellType.FORMULA,
> getCachedFormulaResultType() == CellType.NUMERIC, and
> getNumericCellValue() == 1.0.
>
> So far so good. But if we say
> cell.setCellType(CellType.NUMERIC), what will happen? Spoiler: anything
> can happen.
> For HSSF and SXSSF it will be FORMULA (why??)
> For XSSF it will be BLANK (why??)
>
> Such discrepancies may happen whenever implicit value/type casting is
> performed, i.e. for setCellValue, get*CellValue, setCellType and
> setCellFormula.
>
> So,
> 1. I can prepare a set of contracts for such functions and present them
> for discussion here. The passed through ones will be clearly stated in the
> Cell interface, tests updated (1 clause in the javadoc = 1 test),
> implementations fixed.
>
> 2. A group of stronger and more invasive specific proposals:
>   a) mark setCellType @Deprecated. It's not something you can do in Excel,
> nor the implicit value conversion logic is reliable. setCellType(FORMULA)
> is meaningless. On the contrary, setBlank() should be a part of the public
> interface because it has a clear meaning: delete both value and formula,
> and preserve the cell with its style.
>   b) make get*Value return the value only if the value type strictly
> corresponds to the method name. The evaluator performs conversions on its
> own, and I very much doubt that users rely on a conversion like "get the
> numeric value of the error code".
>   c) at last, remove FORMULA from CellType and deprecate
> getCachedFormulaResultType. Why should anyone be concerned with the formula
> while reading a value? get*Value work identically, be formula set or not. I
> suspect there may be a reason for current state, though, because HSSF and
> XSSF implementations are strongly tied to the underlying xml beans. But
> there's still a good chance this removal is possible.
>   d) probably, CellType._NONE can also be removed
>   e) it also may be a good idea to remove BLANK as well (simultaneously
> adding isBlank() method), because it's "not a value".
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>