[Bug 57008] Wrting _x0427_ to a string cell changes the string to some strange UTF-8 character

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Bug 57008] Wrting _x0427_ to a string cell changes the string to some strange UTF-8 character

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=57008

--- Comment #14 from Mark Murphy <[hidden email]> ---
(In reply to Matthias Gerth from comment #13)

> I've written an escape function as a workaround.
> https://stackoverflow.com/questions/48222502/xssfcell-in-apache-poi-encodes-
> certain-character-sequences-as-unicode-character
>
> So my use case is this: I need to store a string containing "_x24B8_" into
> an excel file. This is user input and I can not prevent this. The setValue
> function on XSSFCell has one parameter of type java.lang.String. Java string
> does not use microsoft encoding to represent unicode character.
>
> So this happens
> 1. String value = "_x24B8_";
> 2. String valueEscaped = escape(value); // "_x005F_x24B8_"
> 3. cell.setValue(valueEscaped) // cell.value is now "_x24B8_"
> 4. once the file is written is changes back to "_x005F_x24B8_" in the file
>
> I think setValue should not call XSSFRichTextString.utfDecode(). This would
> prevent this back and forth encoding.
> We could also make XSSFRichTextString.utfDecode() a public for people who
> are using this type of encoding. I would prefer this microsoft encoding
> terminated within the library since it is specific to the office file format.

In my mind, this is how it should work: cell.setValue() should perform a
utfEscape() (this does not currently exist) which will add "_x005F" to anything
that looks like "_x####_" but not to things that look like "_x005F_x####_".
conversely cell.getRichTextString() should perform a utfUnescape() (this does
not exist) which will strip the "_x005F" from anything that looks like
"_x005F_x####_". But we need to make sure that when reading a file, or
processing XML, these Unicode special characters remain unchanged. Section
3.18.96 in Office Open XML 1st Edition Part 4 indicates that Unicode characters
that cannot be in an XML document need to be encoded as "_x####_". I was unable
to find out just which characters those were, so I don't know about encoding
and decoding these characters. Right now it looks like we just decode "_x####_"
to the Unicode character on read.

So If I use your example above:
1. cell.setValue("_x24B8_"); // resulting XML should contain "_x005F_x24B8_"
2. cell.getRichTextString(); // should return "_x24B8_"

For a file read in that contains "_x24B8_" in a cell:
3. cell.getRechTextString(): // should return "Ⓒ" - utfDecode() does this
correctly.

For a file read in that contains "_x005F_x24B8_" in a cell:
3. cell.getRechTextString(): // should return "_x24B8_"

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]