[Bug 63813] New: Special character (greater than equal) converts to '(' text in word documents

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 63813] New: Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

            Bug ID: 63813
           Summary: Special character (greater than equal) converts to '('
                    text in word documents
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

Version:

POI 4.1.0

I have documents (either 'doc' or 'docx') that have a special character for
'greater than equal' and using codes in 'WordToHtmlConverter', I see those
characters are converted into '('.

I tried with the latest apache poi release 4.1.0.


My java code is:


public class TestWordtoHtmlConverter {

    public static void main(String[] args ) {
        try {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream(args[0]));

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());

        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();

        String result = new String(out.toByteArray());
        System.out.println(result);
      } catch (Exception e) {
      }

Is there anyway I can correctly identify these symbols?


In the sample document, I am interested in getting 'bad one'.


Thanks

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #1 from [hidden email] ---
Created attachment 36814
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=36814&action=edit
symbl test example doc document

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

Axel Howind <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #2 from Axel Howind <[hidden email]> ---
Since I want to learn about the non-Excel formats in POI, I am trying to find
out what's going on here. Three things so far:

 - I can confirm that the first one is rendered as '>=' (as a single character)
in word (at least on MAC)
 - the program produces the wrong output
 - as far as I can tell, the error has nothing to do with the converter because
I can see the '(' showing up in the debugger when inspecting the `wordDocument`
variable before the converter is even initialised.

I will see if I can find out what's wrong, but no promises (this is my first
time ever to look at the word code).

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #3 from Axel Howind <[hidden email]> ---
When reading the word file, text pieces are read by converting `byte[]` to
String in `buildInitSB()`. I investigated the raw data passed to that method:

- so according to the unicode table, the "greater or equal sign" has the code
0x2265 which I also see in the debugger right before the "good one" bytes.

- right before "bad one" there's a 0x0028, which in Unicode is the left
parenthesis.

So it seems that the error happens at a very low level when reading the byte
stream.

-----

Additional findings: LibreOffice doesn't render the symbol in front of "bad
one" at all. Pages displays the correct symbol.

-----

Extracting the file on the command line yields:

axel@xiaolong tmp % unzip ../symbol_test.doc
Archive:  ../symbol_test.doc
warning [../symbol_test.doc]:  10574 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: [Content_Types].xml    
  inflating: _rels/.rels            
  inflating: theme/theme/themeManager.xml  
  inflating: theme/theme/theme1.xml  
  inflating: theme/theme/_rels/themeManager.xml.rels  

Could it be that the file is corrupt? Compare with a simple test document:

axel@xiaolong tmp % unzip ../Test.docx
Archive:  ../Test.docx
  inflating: [Content_Types].xml    
  inflating: _rels/.rels            
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml      
  inflating: word/theme/theme1.xml  
  inflating: word/settings.xml      
  inflating: docProps/core.xml      
  inflating: word/fontTable.xml      
  inflating: word/webSettings.xml    
  inflating: word/styles.xml        
  inflating: docProps/app.xml

But since Apple pages renders it correctly and you said that you have multiple
such documents, maybe I am missing something.

Anyway, I'm out of this one.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #4 from [hidden email] ---
(In reply to Axel Howind from comment #3)
Thanks for looking into this issue.

>
> -----
>
> Extracting the file on the command line yields:
>
> axel@xiaolong tmp % unzip ../symbol_test.doc
> Archive:  ../symbol_test.doc
> warning [../symbol_test.doc]:  10574 extra bytes at beginning or within
> zipfile
>   (attempting to process anyway)
>   inflating: [Content_Types].xml    
>   inflating: _rels/.rels            
>   inflating: theme/theme/themeManager.xml  
>   inflating: theme/theme/theme1.xml  
>   inflating: theme/theme/_rels/themeManager.xml.rels  
>

I think it is since 'symbol_test' is 'doc' type where as 'Test.docx' is 'ooxml
docx' type.

> Could it be that the file is corrupt? Compare with a simple test document:
>
> axel@xiaolong tmp % unzip ../Test.docx
> Archive:  ../Test.docx
>   inflating: [Content_Types].xml    
>   inflating: _rels/.rels            
>   inflating: word/_rels/document.xml.rels  
>   inflating: word/document.xml      
>   inflating: word/theme/theme1.xml  
>   inflating: word/settings.xml      
>   inflating: docProps/core.xml      
>   inflating: word/fontTable.xml      
>   inflating: word/webSettings.xml    
>   inflating: word/styles.xml        
>   inflating: docProps/app.xml
>
> But since Apple pages renders it correctly and you said that you have
> multiple such documents, maybe I am missing something.
>
> Anyway, I'm out of this one.

Yes I have many documents and besides it is not only 'greater than equal'
symbol but there are other characters that are converetd into '('.
I am in need of identifying each of this character to postprocess it.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #5 from Dominik Stadler <[hidden email]> ---
Unfortunately this seems to be caused somewhere deep in the Microsoft DOC
binary format, the text-bytes that we read from the document-stream in class
TextPiece already results in ") bad one", so there is no conversion in Apache
POI as far as I see, but still LibreOffice can display this correctly, so it
seems there is some additional information stored somewhere in the data which
Apache POI does not read/interpret yet.

This would need much more knowledge about this format than I can provide,
sorry, hopefully someone else can come up with a clue why this happens.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #6 from [hidden email] ---
(In reply to Dominik Stadler from comment #5)
> Unfortunately this seems to be caused somewhere deep in the Microsoft DOC
> binary format, the text-bytes that we read from the document-stream in class
> TextPiece already results in ") bad one", so there is no conversion in
> Apache POI as far as I see, but still LibreOffice can display this
> correctly, so it seems there is some additional information stored somewhere
> in the data which Apache POI does not read/interpret yet.
>
> This would need much more knowledge about this format than I can provide,
> sorry, hopefully someone else can come up with a clue why this happens.

Thanks Dominik
I downloaded Libreoffice and saved the document into HTML output. You're right
that the libreloffice outputs this correctly. Is there any way to mimic this
behaviour in Apache POI?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #7 from Dominik Stadler <[hidden email]> ---
Unfortunately it seems this information is stored in a way that Apache POI does
not support right now, so it would need someone to find the time and expertise
to dig into the format and the code of Apache POI, no way to "mimic" as far as
I see.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #8 from [hidden email] ---
(In reply to Dominik Stadler from comment #7)
> Unfortunately it seems this information is stored in a way that Apache POI
> does not support right now, so it would need someone to find the time and
> expertise to dig into the format and the code of Apache POI, no way to
> "mimic" as far as I see.

Thanks Dominik for looking into this issue.
I would love to involve to approach this issue and I never did before, but have
used the apache poi API for a while. I have time but no expertise, is there any
route to get some help from experts to start or any instructions to follow for
someone like me?

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[Bug 63813] Special character (greater than equal) converts to '(' text in word documents

Bugzilla from bugzilla@apache.org
In reply to this post by Bugzilla from bugzilla@apache.org
https://bz.apache.org/bugzilla/show_bug.cgi?id=63813

--- Comment #9 from Dominik Stadler <[hidden email]> ---
Unfortunately it may require some getting used to the topic if you find the
time.

You may first need to review the official technical documentation from
Microsoft at
https://msdn.microsoft.com/en-us/library/cc313105%28v=office.12%29.aspx and
compare this with the actual code in Apache POI, e.g. the starting point would
be the constructor of class HWPFDocument and the classes used there to read the
binary format.

Otherwise the dev-mailing list will be a good place for asking questions while
you go along.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]