Performance Question with CTSheetDataImpl.java

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Performance Question with CTSheetDataImpl.java

Bryce.Alcock

I am looking at some performance issues with POI saving XLSX
to disk for large spreadsheets (65,000 + rows).

I have tracked the issue down to:
sheetData.setRowArray(rArray.toArray(new CTRow[rArray.size()]));


which invokes some Code:
CTSheetDataImpl.java

invokes...
XmlComplexContentImpl.java

invokes...
Xobj.java



My Question:
I cannot find the CTSheetDataImpl.java code.
Where is that code, is it opensource?

I am tasked with making this faster.
Please let me know where this code is:

Full Package:
org/openxmlformats/schemas/spreadsheetml/x2006/main/impl/CTSheetDataImpl.class

BTW
www.openxmlformats.org takes you to the following website:
http://www.microsoft.com/en/us/default.aspx



Thanks


Bryce Alcock * Performance Architect * SunGard *
http://www.brycealcock.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

Nick Burch-11
On Wed, 5 May 2010, [hidden email] wrote:
> My Question:
> I cannot find the CTSheetDataImpl.java code.
> Where is that code, is it opensource?

It's one of the files that are auto-generated by xmlbeans from the ecma /
microsoft XSDs. If you download the source to poi and execute the ant task
"compile-ooxml-xsds", it will get xmlbeans to generate a source jar file
for you.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher
In reply to this post by Bryce.Alcock
Hi Bryce,

Openxml classes are built from the XSDs. This is something that is seldom done.

Download the full source and look into the maven directory in the ooxml-schemas.pom and you will see this comment:

  <description>XmlBeans generated from the Ecma supplied xsds:
    http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%20Part%204%20(DOCX).zip</description>

Apache xmlbeans 2.3.0 is used.

Good luck, someone else will need to help with any deeper detail.

Regards,
Dave

On May 5, 2010, at 2:17 PM, <[hidden email]> <[hidden email]> wrote:

>
> I am looking at some performance issues with POI saving XLSX
> to disk for large spreadsheets (65,000 + rows).
>
> I have tracked the issue down to:
> sheetData.setRowArray(rArray.toArray(new CTRow[rArray.size()]));
>
>
> which invokes some Code:
> CTSheetDataImpl.java
>
> invokes...
> XmlComplexContentImpl.java
>
> invokes...
> Xobj.java
>
>
>
> My Question:
> I cannot find the CTSheetDataImpl.java code.
> Where is that code, is it opensource?
>
> I am tasked with making this faster.
> Please let me know where this code is:
>
> Full Package:
> org/openxmlformats/schemas/spreadsheetml/x2006/main/impl/CTSheetDataImpl.class
>
> BTW
> www.openxmlformats.org takes you to the following website:
> http://www.microsoft.com/en/us/default.aspx
>
>
>
> Thanks
>
>
> Bryce Alcock * Performance Architect * SunGard *
> http://www.brycealcock.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

Mark Beardsley
In reply to this post by Bryce.Alcock
If performance is an issue with such a parge worksheet, have you ooked into Yegor's BigGridDemo code yet? I do not have the link to hand to post and accept that it is a proof so to speak and not a complete application, but it does demonstrate an alternative technique if you are working with large amounts of data.

Yours

Mark B

Bryce.Alcock wrote
I am looking at some performance issues with POI saving XLSX
to disk for large spreadsheets (65,000 + rows).

I have tracked the issue down to:
sheetData.setRowArray(rArray.toArray(new CTRow[rArray.size()]));


which invokes some Code:
CTSheetDataImpl.java

invokes...
XmlComplexContentImpl.java

invokes...
Xobj.java



My Question:
I cannot find the CTSheetDataImpl.java code.
Where is that code, is it opensource?

I am tasked with making this faster.
Please let me know where this code is:

Full Package:
org/openxmlformats/schemas/spreadsheetml/x2006/main/impl/CTSheetDataImpl.class

BTW
www.openxmlformats.org takes you to the following website:
http://www.microsoft.com/en/us/default.aspx



Thanks


Bryce Alcock * Performance Architect * SunGard *
http://www.brycealcock.com




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock

I would definitely like to see this if anyone has a link
to this and can provide a usable demo.


Thanks
Bryce

---------------------------------------------------------------------------
Bryce Alcock * Performance Architect * SunGard *
...........................................................................
personal website http://www.brycealcock.com/
---------------------------------------------------------------------------





-----Original Message-----
From: MSB [mailto:[hidden email]]
Sent: Thu 5/6/2010 1:40 AM
To: [hidden email]
Subject: Re: Performance Question with CTSheetDataImpl.java
 

If performance is an issue with such a parge worksheet, have you ooked into
Yegor's BigGridDemo code yet? I do not have the link to hand to post and
accept that it is a proof so to speak and not a complete application, but it
does demonstrate an alternative technique if you are working with large
amounts of data.

Yours

Mark B


Bryce.Alcock wrote:

>
>
> I am looking at some performance issues with POI saving XLSX
> to disk for large spreadsheets (65,000 + rows).
>
> I have tracked the issue down to:
> sheetData.setRowArray(rArray.toArray(new CTRow[rArray.size()]));
>
>
> which invokes some Code:
> CTSheetDataImpl.java
>
> invokes...
> XmlComplexContentImpl.java
>
> invokes...
> Xobj.java
>
>
>
> My Question:
> I cannot find the CTSheetDataImpl.java code.
> Where is that code, is it opensource?
>
> I am tasked with making this faster.
> Please let me know where this code is:
>
> Full Package:
> org/openxmlformats/schemas/spreadsheetml/x2006/main/impl/CTSheetDataImpl.class
>
> BTW
> www.openxmlformats.org takes you to the following website:
> http://www.microsoft.com/en/us/default.aspx
>
>
>
> Thanks
>
>
> Bryce Alcock * Performance Architect * SunGard *
> http://www.brycealcock.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
--
View this message in context: http://old.nabble.com/Performance-Question-with-CTSheetDataImpl.java-tp28466607p28469819.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]









---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher
You should be able to find it in the poi source in the examples:

org.apache.poi.xssf.usermodel.examples

./examples/src/org/apache/poi/xssf/usermodel/examples/BigGridDemo.java

Regards,
Dave

On May 6, 2010, at 6:41 AM, <[hidden email]> <[hidden email]> wrote:

>
> I would definitely like to see this if anyone has a link
> to this and can provide a usable demo.
>
>
> Thanks
> Bryce
>
> ---------------------------------------------------------------------------
> Bryce Alcock * Performance Architect * SunGard *
> ...........................................................................
> personal website http://www.brycealcock.com/
> ---------------------------------------------------------------------------
>
>
>
>
>
> -----Original Message-----
> From: MSB [mailto:[hidden email]]
> Sent: Thu 5/6/2010 1:40 AM
> To: [hidden email]
> Subject: Re: Performance Question with CTSheetDataImpl.java
>
>
> If performance is an issue with such a parge worksheet, have you ooked into
> Yegor's BigGridDemo code yet? I do not have the link to hand to post and
> accept that it is a proof so to speak and not a complete application, but it
> does demonstrate an alternative technique if you are working with large
> amounts of data.
>
> Yours
>
> Mark B
>
>
> Bryce.Alcock wrote:
>>
>>
>> I am looking at some performance issues with POI saving XLSX
>> to disk for large spreadsheets (65,000 + rows).
>>
>> I have tracked the issue down to:
>> sheetData.setRowArray(rArray.toArray(new CTRow[rArray.size()]));
>>
>>
>> which invokes some Code:
>> CTSheetDataImpl.java
>>
>> invokes...
>> XmlComplexContentImpl.java
>>
>> invokes...
>> Xobj.java
>>
>>
>>
>> My Question:
>> I cannot find the CTSheetDataImpl.java code.
>> Where is that code, is it opensource?
>>
>> I am tasked with making this faster.
>> Please let me know where this code is:
>>
>> Full Package:
>> org/openxmlformats/schemas/spreadsheetml/x2006/main/impl/CTSheetDataImpl.class
>>
>> BTW
>> www.openxmlformats.org takes you to the following website:
>> http://www.microsoft.com/en/us/default.aspx
>>
>>
>>
>> Thanks
>>
>>
>> Bryce Alcock * Performance Architect * SunGard *
>> http://www.brycealcock.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Performance-Question-with-CTSheetDataImpl.java-tp28466607p28469819.html
> Sent from the POI - Dev mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock
In reply to this post by Nick Burch-11

Thanks Nick,
Getting Closer:


Quick question on this,  I need to recompile with Line numbers on (for the debugger).
Do you know how I can do that.  (you may want to make that a permanent change...)


here is the ant task for ref:  (I have added some echo message directives...)

    <target name="compile-ooxml-xsds"
            depends="check-jars,fetch-jars,check-compiled-ooxml-xsds"
            description="Unpacks the OOXML xsd files, and compiles them into XmlBeans">
        <property name="ooxml.xsds.tmp.dir" location="build/ooxml-xsds"/>
         <echo message="setting property : ooxml.xsds.tmp.dir - ${ooxml.xsds.tmp.dir}"/>
        <mkdir dir="${ooxml.xsds.tmp.dir}"/>

        <taskdef name="xmlbean"
                 classname="org.apache.xmlbeans.impl.tool.XMLBean"
                 classpath="${ooxml.xmlbeans.jar}:${ooxml.jsr173.jar}"/>

         <echo message="unzip ${ooxml.xsds.izip} into - ${ooxml.xsds.tmp.dir}"/>
        <unzip src="${ooxml.xsds.izip}" dest="${ooxml.xsds.tmp.dir}"/>
        <!--
              schema="build/ooxml-xsds/"
              schema="build/ooxml-xsds/sml-workbook.xsd"
          -->
        <echo message="xmlbean task :  schema->${ooxml.xsds.tmp.dir} : srcgendir - ${ooxml.xsds.src.dir} : destfile - ${ooxml.xsds.jar} "/>
        <xmlbean
                schema="${ooxml.xsds.tmp.dir}"
                srcgendir="${ooxml.xsds.src.dir}"
                optimize="yes"
                destfile="${ooxml.xsds.jar}"
                javasource="1.5"
                failonerror="true"
                fork="true"
                memoryMaximumSize="512m"
                >
            <classpath refid="ooxml.classpath"/>
        </xmlbean>

        <echo message="jar the schema sources: basedir - ${ooxml.xsds.src.dir} : destfile - ${ooxml.xsds.src.jar}"/>
        <!-- Now make a jar of the schema sources -->
        <jar
                basedir="${ooxml.xsds.src.dir}"
                destfile="${ooxml.xsds.src.jar}"
                />
        <echo message="leaving ooxml-compile-xsds"/>
    </target>



Bryce Alcock * Performance Architect * SunGard *
Asset Arena * 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
Tel 630-986-3006 * Confidential Fax 630-515-1908 *
www.sungard.com/assetarena



-----Original Message-----
From: Nick Burch [mailto:[hidden email]]
Sent: Wed 5/5/2010 5:06 PM
To: POI Developers List
Subject: Re: Performance Question with CTSheetDataImpl.java
 
On Wed, 5 May 2010, [hidden email] wrote:
> My Question:
> I cannot find the CTSheetDataImpl.java code.
> Where is that code, is it opensource?

It's one of the files that are auto-generated by xmlbeans from the ecma /
microsoft XSDs. If you download the source to poi and execute the ant task
"compile-ooxml-xsds", it will get xmlbeans to generate a source jar file
for you.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Nick Burch-11
On Thu, 6 May 2010, [hidden email] wrote:
> Quick question on this, I need to recompile with Line numbers on (for
> the debugger). Do you know how I can do that.

You'll need to check the xmlbeans documentation. I've no idea what the
flag you'll need to pass to the xmlbeans ant task is off the top of my
head.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock

Just in keeping anyone interested in-tune to what I am finding:
(Slowly plodding forward)

The Code from xmlbeans  (XmlComplexContentImpl.java)
has a tight loop, that scales as O(n) for a single match,
and the way XMLBeans is being used within POI causes a scaling of
O(n^2) because each row has hits this method.

this is significant.
If This could be re-written to be O(1) or at least O(log n)  then
POI would scale linearly (or log linearly) with spreadsheet size.

We may need to engage the XMLBean folks on this matter.
OR SOMEONE with a really strong algorithms background...

Please let me know if we can figure out how to proceed with this:
(I WOULD LIKE TO GET THE BEST PEOPLE ENGAGED as I think this could
have a significant impact on POI's usability)

 
CODE:   STARTING at Line 1137 of org.apache.xmlbeans.impl.values.XmlComplexContentImpl.java
-------------------------------------------------------------------------------------------
                // The new object matches what already exists in the array
                // Heuristic: we optimize for the case where the new elements
                // in the array are the same as the existing elements with
                // potentially new elements inserted

                // First insert the new element in the array at position 0
                int j = 0;
                for ( j = 0; j < i; j++ )
                {
                    TypeStoreUser user = store.insert_element_user( elemName, j );
                    ((XmlObjectBase) user).set( sources[j] );
                }
                for ( i++, j++; i < sources.length; i++, j++)
                {
                    XmlCursor c = sources[i].isImmutable() ? null : sources[i].newCursor();
                    if (c != null && c.toParent() && c.getObject() == this)
                    {
                        c.dispose();
                        current = store.find_element_user( elemName, j );
                        if (current == sources[i])
                            ; // advance
                        else
                        {
                            // Fall back to the general case
                            break;
                        }
                    }
                    else
                    {
                        c.dispose();
                        // Insert before the current element
                        TypeStoreUser user = store.insert_element_user( elemName, j );
                        ((XmlObjectBase) user).set( sources[i] );
                    }
                }
                startDest = j;
                startSrc = i;
                m = store.count_elements( elemName );
            }
            // Fall through
        }
        else
        {
            // All of the elements in the existing array are to
            // be deleted and replaced with elements from the
            // sources array
        }



Thanks
Bryce

---------------------------------------------------------------------------
Bryce Alcock * Performance Architect * SunGard *
...........................................................................
http://www.brycealcock.com/
---------------------------------------------------------------------------





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher
Bryce,

I have a suggestion.

(1) See what happens if you use XMLBeans 2.4.0.

(a) They may have fixed this strange algorithm.

(b) Or, POI may be incompatible with XMLBeans 2.4.0.

If that fails we can discuss the next steps - which would probably to start asking questions about this poor algorithm choice in XMLBeans.

If it succeeds then we have a good reason to move the trunk of POI up to XMLBeans 2.4 and to produce a new version of OOXML.

Without seeing the whole loop, I can't agree that it is definitely O(N^2), but it looks like it. -It looks like they implemented simple logic for small test cases without concern for large (or even huge) node lists.

Here is XMLBean's JIRA - https://issues.apache.org/jira/secure/BrowseProject.jspa?id=10436

This might be helpful - http://wiki.apache.org/xmlbeans/V2Features

Regards,
Dave

On May 6, 2010, at 10:17 AM, <[hidden email]> <[hidden email]> wrote:

>
> Just in keeping anyone interested in-tune to what I am finding:
> (Slowly plodding forward)
>
> The Code from xmlbeans  (XmlComplexContentImpl.java)
> has a tight loop, that scales as O(n) for a single match,
> and the way XMLBeans is being used within POI causes a scaling of
> O(n^2) because each row has hits this method.
>
> this is significant.
> If This could be re-written to be O(1) or at least O(log n)  then
> POI would scale linearly (or log linearly) with spreadsheet size.
>
> We may need to engage the XMLBean folks on this matter.
> OR SOMEONE with a really strong algorithms background...
>
> Please let me know if we can figure out how to proceed with this:
> (I WOULD LIKE TO GET THE BEST PEOPLE ENGAGED as I think this could
> have a significant impact on POI's usability)
>
>
> CODE:   STARTING at Line 1137 of org.apache.xmlbeans.impl.values.XmlComplexContentImpl.java
> -------------------------------------------------------------------------------------------
>                // The new object matches what already exists in the array
>                // Heuristic: we optimize for the case where the new elements
>                // in the array are the same as the existing elements with
>                // potentially new elements inserted
>
>                // First insert the new element in the array at position 0
>                int j = 0;
>                for ( j = 0; j < i; j++ )
>                {
>                    TypeStoreUser user = store.insert_element_user( elemName, j );
>                    ((XmlObjectBase) user).set( sources[j] );
>                }
>                for ( i++, j++; i < sources.length; i++, j++)
>                {
>                    XmlCursor c = sources[i].isImmutable() ? null : sources[i].newCursor();
>                    if (c != null && c.toParent() && c.getObject() == this)
>                    {
>                        c.dispose();
>                        current = store.find_element_user( elemName, j );
>                        if (current == sources[i])
>                            ; // advance
>                        else
>                        {
>                            // Fall back to the general case
>                            break;
>                        }
>                    }
>                    else
>                    {
>                        c.dispose();
>                        // Insert before the current element
>                        TypeStoreUser user = store.insert_element_user( elemName, j );
>                        ((XmlObjectBase) user).set( sources[i] );
>                    }
>                }
>                startDest = j;
>                startSrc = i;
>                m = store.count_elements( elemName );
>            }
>            // Fall through
>        }
>        else
>        {
>            // All of the elements in the existing array are to
>            // be deleted and replaced with elements from the
>            // sources array
>        }
>
>
>
> Thanks
> Bryce
>
> ---------------------------------------------------------------------------
> Bryce Alcock * Performance Architect * SunGard *
> ...........................................................................
> http://www.brycealcock.com/
> ---------------------------------------------------------------------------
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock
Dave  and others... ,

Took your suggestion and pulled in XMLBeans 2.4
Worked but contained the O(n^2) issue.

I have spent some time further investigating this, and it is
fairly simple to analyse and understand why it is O(n^2).

The real question is what can we do.
NOTE:  My line numbers won't match with SVN because I have modified the code to put
some timers and system outs in place.


Just a Quick overview
it seems like XmlComplexContentImpl.arraySetterHelper iterates over all elements in the spread sheet.
then in side of that iteration, the Xobj.find_element_user(Qname,int) is an Order n operation.

Just making Xobj.find_element_user a log(n) would completely solve the problem for me.



ANALYSIS:   O(n^2) in XML Beans:   (Around Line 1153 or so....)
-----------------------------------------------------------
//   Order N  (this is invoked 1 time for each element in sources (~120,000 records in my case)


for ( i++, j++; i < sources.length; i++, j++)
{
long t1 = System.currentTimeMillis();
    XmlCursor c = sources[i].isImmutable() ? null : sources[i].newCursor();
getCursorTime += (System.currentTimeMillis()-t1);
long t2=System.currentTimeMillis();
    if (c != null && c.toParent() && c.getObject() == this)
    {
ifCheckTime += (System.currentTimeMillis()-t2);
long t3 = System.currentTimeMillis();
        c.dispose();
long t4 = System.currentTimeMillis();

//  THIS CAUSES O(n^2) SEE CODE BELOW, but this on O(n) for each record.
//  THIS IS THE INEFFICIENCY in XMLBeans...

        current = store.find_element_user( elemName, j );


findUserElement += (System.currentTimeMillis() - t4);
        if (current == sources[i])
              ; // advance
        else
        {
             // Fall back to the general case
             break;
        }
insideIfTime += (System.currentTimeMillis()-t3);
   }
   else
   {
ifCheckTime += (System.currentTimeMillis()-t2);
       c.dispose();
       // Insert before the current element
       TypeStoreUser user = store.insert_element_user( elemName, j );
       ((XmlObjectBase) user).set( sources[i] );
    }
}





---------------- Method from store.find_element_user(QName,int i)---------

//  THIS METHOD IS 0(n)
//  but this is called n times by XmlComplexContentImpl.java around line 1153....
public TypeStoreUser find_element_user ( QName name, int i )
 {
     for ( Xobj x = _firstChild ; x != null ; x = x._nextSibling )
        if (x.isElem() && x._name.equals( name ) && --i < 0)
            return x.getUser();
        return null;
 }



Bryce,

I have a suggestion.

(1) See what happens if you use XMLBeans 2.4.0.

(a) They may have fixed this strange algorithm.

(b) Or, POI may be incompatible with XMLBeans 2.4.0.

If that fails we can discuss the next steps - which would probably to start asking questions about this poor algorithm choice in XMLBeans.

If it succeeds then we have a good reason to move the trunk of POI up to XMLBeans 2.4 and to produce a new version of OOXML.

Without seeing the whole loop, I can't agree that it is definitely O(N^2), but it looks like it. -It looks like they implemented simple logic for small test cases without concern for large (or even huge) node lists.

Here is XMLBean's JIRA - https://issues.apache.org/jira/secure/BrowseProject.jspa?id=10436

This might be helpful - http://wiki.apache.org/xmlbeans/V2Features

Regards,
Dave



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher
Bryce,

> public TypeStoreUser find_element_user ( QName name, int i )
> {
>     for ( Xobj x = _firstChild ; x != null ; x = x._nextSibling )
>        if (x.isElem() && x._name.equals( name ) && --i < 0)
>            return x.getUser();
>        return null;
> }

Why that is simply very bad programing. Why are named items in a linked list? Because it is easy to program for small sets.

I think you have a perfect example of something that ought to be patched in XMLBeans. There are two paths

(1) Figure out how to patch xmlbeans to use a HashMap for name references yet retain the linked list for preserving the order. When done contribute it to Apache xmlbeans.

(2) Ask the xmlbeans user list about this trouble.

Regards,
Dave

On May 7, 2010, at 10:48 AM, <[hidden email]> <[hidden email]> wrote:

> Dave  and others... ,
>
> Took your suggestion and pulled in XMLBeans 2.4
> Worked but contained the O(n^2) issue.
>
> I have spent some time further investigating this, and it is
> fairly simple to analyse and understand why it is O(n^2).
>
> The real question is what can we do.
> NOTE:  My line numbers won't match with SVN because I have modified the code to put
> some timers and system outs in place.
>
>
> Just a Quick overview
> it seems like XmlComplexContentImpl.arraySetterHelper iterates over all elements in the spread sheet.
> then in side of that iteration, the Xobj.find_element_user(Qname,int) is an Order n operation.
>
> Just making Xobj.find_element_user a log(n) would completely solve the problem for me.
>
>
>
> ANALYSIS:   O(n^2) in XML Beans:   (Around Line 1153 or so....)
> -----------------------------------------------------------
> //   Order N  (this is invoked 1 time for each element in sources (~120,000 records in my case)
>
>
> for ( i++, j++; i < sources.length; i++, j++)
> {
> long t1 = System.currentTimeMillis();
>    XmlCursor c = sources[i].isImmutable() ? null : sources[i].newCursor();
> getCursorTime += (System.currentTimeMillis()-t1);
> long t2=System.currentTimeMillis();
>    if (c != null && c.toParent() && c.getObject() == this)
>    {
> ifCheckTime += (System.currentTimeMillis()-t2);
> long t3 = System.currentTimeMillis();
>        c.dispose();
> long t4 = System.currentTimeMillis();
>
> //  THIS CAUSES O(n^2) SEE CODE BELOW, but this on O(n) for each record.
> //  THIS IS THE INEFFICIENCY in XMLBeans...
>
>        current = store.find_element_user( elemName, j );
>
>
> findUserElement += (System.currentTimeMillis() - t4);
>        if (current == sources[i])
>              ; // advance
>        else
>        {
>             // Fall back to the general case
>             break;
>        }
> insideIfTime += (System.currentTimeMillis()-t3);
>   }
>   else
>   {
> ifCheckTime += (System.currentTimeMillis()-t2);
>       c.dispose();
>       // Insert before the current element
>       TypeStoreUser user = store.insert_element_user( elemName, j );
>       ((XmlObjectBase) user).set( sources[i] );
>    }
> }
>
>
>
>
>
> ---------------- Method from store.find_element_user(QName,int i)---------
>
> //  THIS METHOD IS 0(n)
> //  but this is called n times by XmlComplexContentImpl.java around line 1153....
> public TypeStoreUser find_element_user ( QName name, int i )
> {
>     for ( Xobj x = _firstChild ; x != null ; x = x._nextSibling )
>        if (x.isElem() && x._name.equals( name ) && --i < 0)
>            return x.getUser();
>        return null;
> }
>
>
>
> Bryce,
>
> I have a suggestion.
>
> (1) See what happens if you use XMLBeans 2.4.0.
>
> (a) They may have fixed this strange algorithm.
>
> (b) Or, POI may be incompatible with XMLBeans 2.4.0.
>
> If that fails we can discuss the next steps - which would probably to start asking questions about this poor algorithm choice in XMLBeans.
>
> If it succeeds then we have a good reason to move the trunk of POI up to XMLBeans 2.4 and to produce a new version of OOXML.
>
> Without seeing the whole loop, I can't agree that it is definitely O(N^2), but it looks like it. -It looks like they implemented simple logic for small test cases without concern for large (or even huge) node lists.
>
> Here is XMLBean's JIRA - https://issues.apache.org/jira/secure/BrowseProject.jspa?id=10436
>
> This might be helpful - http://wiki.apache.org/xmlbeans/V2Features
>
> Regards,
> Dave
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock
In reply to this post by Bryce.Alcock

Continuing to chip away at CTSheetDataImpl's performance issues.
However, the problematic code is unlikely to be changed, I have a discussion going on the XMLBeans website, and as it turns out that code is most likely not going to change, or will not change easily.


With that in mind, I started taking a close look at the POI side of things and very specifically the problematic code from a POI perspective.


At approximately line 2368 of XSSFSheet.java

the following line is the suspect line:
I have modified the code to break 1 line into 3  the slow line is the last one.

        CTRow[] lctRow = new CTRow[rArray.size()];
        lctRow = rArray.toArray(lctRow);
        sheetData.setRowArray(lctRow);



My Question is:  in looking at XSSFSheet.java
are there any assumption that we can make about the state of sheetData (A CTSheetData object)
that would allow us to either overwrite the default setRowArray(lctRow) method,
or side step this call all together by calling a method that is not generic, but instead much
more targeted to the function and process that is currently going on.








Here is the full call in context...Take from my modified version of XSSFSheet.java


    protected void write(OutputStream out) throws IOException {

        if(worksheet.getColsArray().length == 1) {
            CTCols col = worksheet.getColsArray(0);
            CTCol[] cols = col.getColArray();
            if(cols.length == 0) {
                worksheet.setColsArray(null);
            }
        }

        // Now re-generate our CTHyperlinks, if needed
        if(hyperlinks.size() > 0) {
            if(worksheet.getHyperlinks() == null) {
                worksheet.addNewHyperlinks();
            }
            CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
            for(int i=0; i<ctHls.length; i++) {
                // If our sheet has hyperlinks, have them add
                //  any relationships that they might need
                XSSFHyperlink hyperlink = hyperlinks.get(i);
                hyperlink.generateRelationIfNeeded(getPackagePart());
                // Now grab their underling object
                ctHls[i] = hyperlink.getCTHyperlink();
            }
            worksheet.getHyperlinks().setHyperlinkArray(ctHls);
        }

        CTSheetData sheetData = worksheet.getSheetData();
        ArrayList<CTRow> rArray = new ArrayList<CTRow>(rows.size());
        for(XSSFRow row : rows.values()){
            row.onDocumentWrite();
            rArray.add(row.getCTRow());
        }
                long startTimeSheetData = System.currentTimeMillis();
                CTRow[] lctRow = new CTRow[rArray.size()];
                lctRow = rArray.toArray(lctRow);
        sheetData.setRowArray(lctRow);
                System.out.println("Sheet Time : " + (System.currentTimeMillis() - startTimeSheetData));


        XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
        xmlOptions.setSaveSyntheticDocumentElement(new QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
        Map<String, String> map = new HashMap<String, String>();
        map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
        xmlOptions.setSaveSuggestedPrefixes(map);
                long startTimeWorkSheetSave = System.currentTimeMillis();

        worksheet.save(out, xmlOptions);
                System.out.println("Worksheet save Time : " + (System.currentTimeMillis() - startTimeWorkSheetSave));

    }



Best Regards
Bryce Alcock


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher
Hi Bryce,

I am intrigued by your work here. It is very interesting. I think that there are probably some interesting possibilities with creating a more efficient structure to hold Rows of data. Probably down to the level of the Cells within each row.

Somehow the default behavior of OOXML would need to be overridden. I'm not sure about how to do it.

> Continuing to chip away at CTSheetDataImpl's performance issues.
> However, the problematic code is unlikely to be changed, I have a discussion going on the XMLBeans website, and as it turns out that code is most likely not going to change, or will not change easily.

It might be illuminating if you could either highlight some of the issues discussed on the XMLBeans list, or provide a link to the thread.

If we are going to hack into or over XMLBeans we have to make a strong case.

> With that in mind, I started taking a close look at the POI side of things and very specifically the problematic code from a POI perspective.

I think that what we are discussing is implementing random access to the in memory cell objects. Done in a way where we keep the linear/serial structure this silly algorithm is preserving by using a Tree rather than an array and walking it when output is needed.

This would significantly improve the performance of ooxml.

Regards,
Dave

>
>
> At approximately line 2368 of XSSFSheet.java
>
> the following line is the suspect line:
> I have modified the code to break 1 line into 3  the slow line is the last one.
>
>        CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>        sheetData.setRowArray(lctRow);
>
>
>
> My Question is:  in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData (A CTSheetData object)
> that would allow us to either overwrite the default setRowArray(lctRow) method,
> or side step this call all together by calling a method that is not generic, but instead much
> more targeted to the function and process that is currently going on.
>
>
>
>
>
>
>
>
> Here is the full call in context...Take from my modified version of XSSFSheet.java
>
>
>    protected void write(OutputStream out) throws IOException {
>
>        if(worksheet.getColsArray().length == 1) {
>            CTCols col = worksheet.getColsArray(0);
>            CTCol[] cols = col.getColArray();
>            if(cols.length == 0) {
>                worksheet.setColsArray(null);
>            }
>        }
>
>        // Now re-generate our CTHyperlinks, if needed
>        if(hyperlinks.size() > 0) {
>            if(worksheet.getHyperlinks() == null) {
>                worksheet.addNewHyperlinks();
>            }
>            CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>            for(int i=0; i<ctHls.length; i++) {
>                // If our sheet has hyperlinks, have them add
>                //  any relationships that they might need
>                XSSFHyperlink hyperlink = hyperlinks.get(i);
>                hyperlink.generateRelationIfNeeded(getPackagePart());
>                // Now grab their underling object
>                ctHls[i] = hyperlink.getCTHyperlink();
>            }
>            worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>        }
>
>        CTSheetData sheetData = worksheet.getSheetData();
>        ArrayList<CTRow> rArray = new ArrayList<CTRow>(rows.size());
>        for(XSSFRow row : rows.values()){
>            row.onDocumentWrite();
>            rArray.add(row.getCTRow());
>        }
> long startTimeSheetData = System.currentTimeMillis();
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>        sheetData.setRowArray(lctRow);
> System.out.println("Sheet Time : " + (System.currentTimeMillis() - startTimeSheetData));
>
>
>        XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>        xmlOptions.setSaveSyntheticDocumentElement(new QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>        Map<String, String> map = new HashMap<String, String>();
>        map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
>        xmlOptions.setSaveSuggestedPrefixes(map);
> long startTimeWorkSheetSave = System.currentTimeMillis();
>
>        worksheet.save(out, xmlOptions);
> System.out.println("Worksheet save Time : " + (System.currentTimeMillis() - startTimeWorkSheetSave));
>
>    }
>
>
>
> Best Regards
> Bryce Alcock
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Nick Burch-11
In reply to this post by Bryce.Alcock
On Tue, 18 May 2010, [hidden email] wrote:
> My Question is:  in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData
> (A CTSheetData object) that would allow us to either overwrite the
> default setRowArray(lctRow) method, or side step this call all together
> by calling a method that is not generic, but instead much more targeted
> to the function and process that is currently going on.

If memory serves, we basically need to overwrite the old list of row
details with the new ones

We have the list of CTRow objects easily available (via the XSSFRow
objects that wrap them in a friendly interface). As long as we get rid of
the old list, it should be fine. Would we be better of doing something
like setRowArray(new CTRow[0]) then using one of the insert methods, or
would that be equally as bad?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

Yegor Kozlov
In reply to this post by Bryce.Alcock
the purpose of calling CTSheetData#setRowArray is to ensure that the
array of CTRow beans in CTSheetData is ordered. The point is that we
don't always need this operation, in many cases rows are already ordered
and re-assigning them to CTSheetData is an unnecessary expensive operation.

Compare two use cases:

case 1: rows are written in increasing order

XSSFSheet sheet = workbook.createSheet();
sheet.createRow(1);
sheet.createRow(2);
sheet.createRow(3);
workbook.write(out); //no need to re-order rows before writing


case 2: row are created in random order

XSSFSheet sheet = workbook.createSheet();
sheet.createRow(3);
sheet.createRow(2);
sheet.createRow(1);
workbook.write(out); //need to re-order rows before writing

In the first case the call of CTSheetData#setRowArray is extra because
the state of CTSheetData match the logical model. In the second case we
do need to call CTSheetData#setRowArray to ensure that rows are written
in ascending order.

So, the performance of XSSFSheet#write depends on how rows were added:
if rows are added in strict ascending order then XSSFSheet#write is
fast. If the order of rows is random or rows were shifted then
XSSFSheet#write involves CTSheetData#setRowArray and becomes more
expensive.

I committed this improvement in r947542. Please try the latest build
from trunk.

Below are two my tests, the only difference is that in the first test
rows are written in ascending order and in the second test in descending.

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = 0; i < Short.MAX_VALUE; i++) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out);
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }


generate: 5488 ms
write: 2507 ms

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = Short.MAX_VALUE; i > 0; i--) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out); //will involve CTSheetData#setRowArray
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }

generate: 9291 ms
write: 94888 ms

The first test is almost 40X faster!

Regards,
Yegor



> Continuing to chip away at CTSheetDataImpl's performance issues.
> However, the problematic code is unlikely to be changed, I have a discussion going on the XMLBeans website, and as it turns out that code is most likely not going to change, or will not change easily.
>
>
> With that in mind, I started taking a close look at the POI side of things and very specifically the problematic code from a POI perspective.
>
>
> At approximately line 2368 of XSSFSheet.java
>
> the following line is the suspect line:
> I have modified the code to break 1 line into 3  the slow line is the last one.
>
>          CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
>
>
>
> My Question is:  in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData (A CTSheetData object)
> that would allow us to either overwrite the default setRowArray(lctRow) method,
> or side step this call all together by calling a method that is not generic, but instead much
> more targeted to the function and process that is currently going on.
>
>
>
>
>
>
>
>
> Here is the full call in context...Take from my modified version of XSSFSheet.java
>
>
>      protected void write(OutputStream out) throws IOException {
>
>          if(worksheet.getColsArray().length == 1) {
>              CTCols col = worksheet.getColsArray(0);
>              CTCol[] cols = col.getColArray();
>              if(cols.length == 0) {
>                  worksheet.setColsArray(null);
>              }
>          }
>
>          // Now re-generate our CTHyperlinks, if needed
>          if(hyperlinks.size()>  0) {
>              if(worksheet.getHyperlinks() == null) {
>                  worksheet.addNewHyperlinks();
>              }
>              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>              for(int i=0; i<ctHls.length; i++) {
>                  // If our sheet has hyperlinks, have them add
>                  //  any relationships that they might need
>                  XSSFHyperlink hyperlink = hyperlinks.get(i);
>                  hyperlink.generateRelationIfNeeded(getPackagePart());
>                  // Now grab their underling object
>                  ctHls[i] = hyperlink.getCTHyperlink();
>              }
>              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>          }
>
>          CTSheetData sheetData = worksheet.getSheetData();
>          ArrayList<CTRow>  rArray = new ArrayList<CTRow>(rows.size());
>          for(XSSFRow row : rows.values()){
>              row.onDocumentWrite();
>              rArray.add(row.getCTRow());
>          }
> long startTimeSheetData = System.currentTimeMillis();
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
> System.out.println("Sheet Time : " + (System.currentTimeMillis() - startTimeSheetData));
>
>
>          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>          xmlOptions.setSaveSyntheticDocumentElement(new QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>          Map<String, String>  map = new HashMap<String, String>();
>          map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
>          xmlOptions.setSaveSuggestedPrefixes(map);
> long startTimeWorkSheetSave = System.currentTimeMillis();
>
>          worksheet.save(out, xmlOptions);
> System.out.println("Worksheet save Time : " + (System.currentTimeMillis() - startTimeWorkSheetSave));
>
>      }
>
>
>
> Best Regards
> Bryce Alcock
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock
The performance improvement is significant for my simple "Laboratory" test.
See attached PNG file showing Yegor's code vs. Poi before the change.


Yegor's Code Changes Inplace:
Note that the AVG Time per Row does not change (O(1) per row leads to O(n) over all)

File Name       Rows Total AVG
                                Time Mills/
                                Millis Row
./20000_true.xls   20000 1964 0.1
./24000_true.xls   24000 2019 0.08
./28800_true.xls   28800 2624 0.09
./34560_true.xls   34560 3012 0.09
./41472_true.xls   41472 3409 0.08
./49766_true.xls   49766 4617 0.09
./59719_true.xls   59719 6162 0.1
./71662_true.xls   71662 5705 0.08
./85994_true.xls   85994 7487 0.09
./103192_true.xls 103192 9060 0.09
./123830_true.xls 123830 11875 0.1
./148596_true.xls 148596 13591 0.09


Original Poi (pre row order check)
Notice that the AVG Mills/Row increases linearly with rows
(O(n) which leads to O(n^2) as seen in the Total Time)

File Name       Rows Total AVG
                                Time Mills/
                                Millis Row
./20000_true.xls   20000 4655 0.23
./24000_true.xls   24000 6398 0.27
./28800_true.xls   28800 7626 0.26
./34560_true.xls   34560 10178 0.29
./41472_true.xls   41472 17647 0.43
./49766_true.xls   49766 24645 0.5
./59719_true.xls   59719 38765 0.65
./71662_true.xls   71662 63009 0.88
./85994_true.xls   85994 98448 1.14
./103192_true.xls 103192 144313 1.4
./123830_true.xls 123830 217893 1.76
./148596_true.xls 148596 320606 2.16



Thanks for your effort's Yegor.
I will test this in the "wild" with real data soon.


Bryce Alcock * Performance Architect * SunGard *
Asset Arena * 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
Tel 630-986-3006 * Confidential Fax 630-515-1908 *
www.sungard.com/assetarena



-----Original Message-----
From: Yegor Kozlov [mailto:[hidden email]]
Sent: Mon 5/24/2010 12:58 AM
To: POI Developers List
Cc: Alcock, Bryce
Subject: Re: Performance Question with CTSheetDataImpl.java
 
the purpose of calling CTSheetData#setRowArray is to ensure that the
array of CTRow beans in CTSheetData is ordered. The point is that we
don't always need this operation, in many cases rows are already ordered
and re-assigning them to CTSheetData is an unnecessary expensive operation.

Compare two use cases:

case 1: rows are written in increasing order

XSSFSheet sheet = workbook.createSheet();
sheet.createRow(1);
sheet.createRow(2);
sheet.createRow(3);
workbook.write(out); //no need to re-order rows before writing


case 2: row are created in random order

XSSFSheet sheet = workbook.createSheet();
sheet.createRow(3);
sheet.createRow(2);
sheet.createRow(1);
workbook.write(out); //need to re-order rows before writing

In the first case the call of CTSheetData#setRowArray is extra because
the state of CTSheetData match the logical model. In the second case we
do need to call CTSheetData#setRowArray to ensure that rows are written
in ascending order.

So, the performance of XSSFSheet#write depends on how rows were added:
if rows are added in strict ascending order then XSSFSheet#write is
fast. If the order of rows is random or rows were shifted then
XSSFSheet#write involves CTSheetData#setRowArray and becomes more
expensive.

I committed this improvement in r947542. Please try the latest build
from trunk.

Below are two my tests, the only difference is that in the first test
rows are written in ascending order and in the second test in descending.

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = 0; i < Short.MAX_VALUE; i++) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out);
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }


generate: 5488 ms
write: 2507 ms

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = Short.MAX_VALUE; i > 0; i--) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out); //will involve CTSheetData#setRowArray
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }

generate: 9291 ms
write: 94888 ms

The first test is almost 40X faster!

Regards,
Yegor



> Continuing to chip away at CTSheetDataImpl's performance issues.
> However, the problematic code is unlikely to be changed, I have a discussion going on the XMLBeans website, and as it turns out that code is most likely not going to change, or will not change easily.
>
>
> With that in mind, I started taking a close look at the POI side of things and very specifically the problematic code from a POI perspective.
>
>
> At approximately line 2368 of XSSFSheet.java
>
> the following line is the suspect line:
> I have modified the code to break 1 line into 3  the slow line is the last one.
>
>          CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
>
>
>
> My Question is:  in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData (A CTSheetData object)
> that would allow us to either overwrite the default setRowArray(lctRow) method,
> or side step this call all together by calling a method that is not generic, but instead much
> more targeted to the function and process that is currently going on.
>
>
>
>
>
>
>
>
> Here is the full call in context...Take from my modified version of XSSFSheet.java
>
>
>      protected void write(OutputStream out) throws IOException {
>
>          if(worksheet.getColsArray().length == 1) {
>              CTCols col = worksheet.getColsArray(0);
>              CTCol[] cols = col.getColArray();
>              if(cols.length == 0) {
>                  worksheet.setColsArray(null);
>              }
>          }
>
>          // Now re-generate our CTHyperlinks, if needed
>          if(hyperlinks.size()>  0) {
>              if(worksheet.getHyperlinks() == null) {
>                  worksheet.addNewHyperlinks();
>              }
>              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>              for(int i=0; i<ctHls.length; i++) {
>                  // If our sheet has hyperlinks, have them add
>                  //  any relationships that they might need
>                  XSSFHyperlink hyperlink = hyperlinks.get(i);
>                  hyperlink.generateRelationIfNeeded(getPackagePart());
>                  // Now grab their underling object
>                  ctHls[i] = hyperlink.getCTHyperlink();
>              }
>              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>          }
>
>          CTSheetData sheetData = worksheet.getSheetData();
>          ArrayList<CTRow>  rArray = new ArrayList<CTRow>(rows.size());
>          for(XSSFRow row : rows.values()){
>              row.onDocumentWrite();
>              rArray.add(row.getCTRow());
>          }
> long startTimeSheetData = System.currentTimeMillis();
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
> System.out.println("Sheet Time : " + (System.currentTimeMillis() - startTimeSheetData));
>
>
>          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>          xmlOptions.setSaveSyntheticDocumentElement(new QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>          Map<String, String>  map = new HashMap<String, String>();
>          map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
>          xmlOptions.setSaveSuggestedPrefixes(map);
> long startTimeWorkSheetSave = System.currentTimeMillis();
>
>          worksheet.save(out, xmlOptions);
> System.out.println("Worksheet save Time : " + (System.currentTimeMillis() - startTimeWorkSheetSave));
>
>      }
>
>
>
> Best Regards
> Bryce Alcock
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Performance Question with CTSheetDataImpl.java

Bryce.Alcock
Couple of thoughts:

1.  I did a test with "real" data in a controlled environment, and
Identified that in the real world condition this change did help
significantly for large sheets that did not require re-ordering.  12 Min
processing time dropped to 4ish Minutes.  


2.  I have been giving it a bit more thought, and would like to through
the idea of "block" re-order comparison for a further improvements where
the data does not match.  The basic idea is that because the rows are
maintained in a linked list in xml beans, can we on the poi side of
things look at only comparing blocks of data and doing some of the list
manipulation as needed, but skip blocks where we know or can predict
that the underlying list is ordered correctly.

I am just floating this idea to the list.  I will not be able to dig
into it for a bit...
I would love to hear thoughts?


Bryce Alcock * Performance Architect * SunGard * Asset Arena *
377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
www.sungard.com/assetarena

Think before you print



-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Monday, May 24, 2010 10:06 AM
To: [hidden email]; [hidden email]
Subject: RE: Performance Question with CTSheetDataImpl.java

The performance improvement is significant for my simple "Laboratory"
test.
See attached PNG file showing Yegor's code vs. Poi before the change.


Yegor's Code Changes Inplace:
Note that the AVG Time per Row does not change (O(1) per row leads to
O(n) over all)

File Name       Rows Total AVG
                                Time Mills/
                                Millis Row
./20000_true.xls   20000 1964 0.1
./24000_true.xls   24000 2019 0.08
./28800_true.xls   28800 2624 0.09
./34560_true.xls   34560 3012 0.09
./41472_true.xls   41472 3409 0.08
./49766_true.xls   49766 4617 0.09
./59719_true.xls   59719 6162 0.1
./71662_true.xls   71662 5705 0.08
./85994_true.xls   85994 7487 0.09
./103192_true.xls 103192 9060 0.09
./123830_true.xls 123830 11875 0.1
./148596_true.xls 148596 13591 0.09


Original Poi (pre row order check)
Notice that the AVG Mills/Row increases linearly with rows
(O(n) which leads to O(n^2) as seen in the Total Time)

File Name       Rows Total AVG
                                Time Mills/
                                Millis Row
./20000_true.xls   20000 4655 0.23
./24000_true.xls   24000 6398 0.27
./28800_true.xls   28800 7626 0.26
./34560_true.xls   34560 10178 0.29
./41472_true.xls   41472 17647 0.43
./49766_true.xls   49766 24645 0.5
./59719_true.xls   59719 38765 0.65
./71662_true.xls   71662 63009 0.88
./85994_true.xls   85994 98448 1.14
./103192_true.xls 103192 144313 1.4
./123830_true.xls 123830 217893 1.76
./148596_true.xls 148596 320606 2.16



Thanks for your effort's Yegor.
I will test this in the "wild" with real data soon.


Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
Confidential Fax 630-515-1908 * www.sungard.com/assetarena



-----Original Message-----
From: Yegor Kozlov [mailto:[hidden email]]
Sent: Mon 5/24/2010 12:58 AM
To: POI Developers List
Cc: Alcock, Bryce
Subject: Re: Performance Question with CTSheetDataImpl.java
 
the purpose of calling CTSheetData#setRowArray is to ensure that the
array of CTRow beans in CTSheetData is ordered. The point is that we
don't always need this operation, in many cases rows are already ordered
and re-assigning them to CTSheetData is an unnecessary expensive
operation.

Compare two use cases:

case 1: rows are written in increasing order

XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
to re-order rows before writing


case 2: row are created in random order

XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
re-order rows before writing

In the first case the call of CTSheetData#setRowArray is extra because
the state of CTSheetData match the logical model. In the second case we
do need to call CTSheetData#setRowArray to ensure that rows are written
in ascending order.

So, the performance of XSSFSheet#write depends on how rows were added:
if rows are added in strict ascending order then XSSFSheet#write is
fast. If the order of rows is random or rows were shifted then
XSSFSheet#write involves CTSheetData#setRowArray and becomes more
expensive.

I committed this improvement in r947542. Please try the latest build
from trunk.

Below are two my tests, the only difference is that in the first test
rows are written in ascending order and in the second test in
descending.

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = 0; i < Short.MAX_VALUE; i++) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out);
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }


generate: 5488 ms
write: 2507 ms

     public static void main(String[] args) throws Exception {
         long t1 = System.currentTimeMillis();

         XSSFWorkbook wb = new XSSFWorkbook();
         XSSFSheet sh = wb.createSheet();
         for (int i = Short.MAX_VALUE; i > 0; i--) {
             XSSFRow row = sh.createRow(i);
             for (int j = 0; j < 5; j++) {
                 XSSFCell cell = row.createCell(j);
                 cell.setCellValue(i*j);
             }
         }
         long t2 = System.currentTimeMillis();
         System.out.println("generate: " + (t2-t1) + " ms");

         FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
         wb.write(out); //will involve CTSheetData#setRowArray
         out.close();
         long t3 = System.currentTimeMillis();
         System.out.println("write: " + (t3-t2) + " ms");
     }

generate: 9291 ms
write: 94888 ms

The first test is almost 40X faster!

Regards,
Yegor



> Continuing to chip away at CTSheetDataImpl's performance issues.
> However, the problematic code is unlikely to be changed, I have a
discussion going on the XMLBeans website, and as it turns out that code
is most likely not going to change, or will not change easily.
>
>
> With that in mind, I started taking a close look at the POI side of
things and very specifically the problematic code from a POI
perspective.
>
>
> At approximately line 2368 of XSSFSheet.java
>
> the following line is the suspect line:
> I have modified the code to break 1 line into 3  the slow line is the
last one.
>
>          CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
>
>
>
> My Question is:  in looking at XSSFSheet.java
> are there any assumption that we can make about the state of sheetData
(A CTSheetData object)
> that would allow us to either overwrite the default
setRowArray(lctRow) method,
> or side step this call all together by calling a method that is not
generic, but instead much

> more targeted to the function and process that is currently going on.
>
>
>
>
>
>
>
>
> Here is the full call in context...Take from my modified version of
XSSFSheet.java

>
>
>      protected void write(OutputStream out) throws IOException {
>
>          if(worksheet.getColsArray().length == 1) {
>              CTCols col = worksheet.getColsArray(0);
>              CTCol[] cols = col.getColArray();
>              if(cols.length == 0) {
>                  worksheet.setColsArray(null);
>              }
>          }
>
>          // Now re-generate our CTHyperlinks, if needed
>          if(hyperlinks.size()>  0) {
>              if(worksheet.getHyperlinks() == null) {
>                  worksheet.addNewHyperlinks();
>              }
>              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>              for(int i=0; i<ctHls.length; i++) {
>                  // If our sheet has hyperlinks, have them add
>                  //  any relationships that they might need
>                  XSSFHyperlink hyperlink = hyperlinks.get(i);
>                  hyperlink.generateRelationIfNeeded(getPackagePart());
>                  // Now grab their underling object
>                  ctHls[i] = hyperlink.getCTHyperlink();
>              }
>              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>          }
>
>          CTSheetData sheetData = worksheet.getSheetData();
>          ArrayList<CTRow>  rArray = new ArrayList<CTRow>(rows.size());
>          for(XSSFRow row : rows.values()){
>              row.onDocumentWrite();
>              rArray.add(row.getCTRow());
>          }
> long startTimeSheetData = System.currentTimeMillis();
> CTRow[] lctRow = new CTRow[rArray.size()];
> lctRow = rArray.toArray(lctRow);
>          sheetData.setRowArray(lctRow);
> System.out.println("Sheet Time : " +
(System.currentTimeMillis() - startTimeSheetData));
>
>
>          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>          xmlOptions.setSaveSyntheticDocumentElement(new
QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>          Map<String, String>  map = new HashMap<String, String>();
>          map.put(STRelationshipId.type.getName().getNamespaceURI(),
"r");
>          xmlOptions.setSaveSuggestedPrefixes(map);
> long startTimeWorkSheetSave =
System.currentTimeMillis();
>
>          worksheet.save(out, xmlOptions);
> System.out.println("Worksheet save Time : " +
(System.currentTimeMillis() - startTimeWorkSheetSave));

>
>      }
>
>
>
> Best Regards
> Bryce Alcock
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

Yegor Kozlov
On 5/27/10 12:49 AM, [hidden email] wrote:
> Couple of thoughts:
>
> 1.  I did a test with "real" data in a controlled environment, and
> Identified that in the real world condition this change did help
> significantly for large sheets that did not require re-ordering.  12 Min
> processing time dropped to 4ish Minutes.
>
>    

I made a similar improvement in XSSFRow#onDocumentWrite(), this method
always re-assigned the array of CTCell even if the cells were ordered.
The performance boost is not so big as we got from the optimization of
XSSFSheet#write. The number of cells is usually small and the call of
CTRow#setCArray is not that expensive, the improvement should be
significant with 'wide' sheets having more than 100 columns.

The fix was committed in r949375

> 2.  I have been giving it a bit more thought, and would like to through
> the idea of "block" re-order comparison for a further improvements where
> the data does not match.  The basic idea is that because the rows are
> maintained in a linked list in xml beans, can we on the poi side of
> things look at only comparing blocks of data and doing some of the list
> manipulation as needed, but skip blocks where we know or can predict
> that the underlying list is ordered correctly.
>
> I am just floating this idea to the list.  I will not be able to dig
> into it for a bit...
> I would love to hear thoughts?
>
>    
I don't quite understand. Are you suggesting to manipulate low-level DOM
objects and swap only those pieces that are not sorted?
In theory you can do that using XmlCursor but the code will be tricky
and harder to maintain. Also, the performance gain is not evident to me.

An alternative approach is to improve XSSFSheet#createRow so that the
array of CTRow beans is kept in a sorted state. Current implementation
appends new rows, this is why we need the trick with re-ordering to
ensure that the resulting XML is valid. If we do this optimization then
the re-ordering trick is extra and can be removed. The performance of
XSSFSheet#createRow can be slower (random insertions in LinkedList!) but
the performance of XSSFSheet#write will be definitely faster.

Yegor

> Bryce Alcock * Performance Architect * SunGard * Asset Arena *
> 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
> www.sungard.com/assetarena
>
> Think before you print
>
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: Monday, May 24, 2010 10:06 AM
> To: [hidden email]; [hidden email]
> Subject: RE: Performance Question with CTSheetDataImpl.java
>
> The performance improvement is significant for my simple "Laboratory"
> test.
> See attached PNG file showing Yegor's code vs. Poi before the change.
>
>
> Yegor's Code Changes Inplace:
> Note that the AVG Time per Row does not change (O(1) per row leads to
> O(n) over all)
>
> File Name       Rows Total AVG
> Time Mills/
> Millis Row
> ./20000_true.xls   20000 1964 0.1
> ./24000_true.xls   24000 2019 0.08
> ./28800_true.xls   28800 2624 0.09
> ./34560_true.xls   34560 3012 0.09
> ./41472_true.xls   41472 3409 0.08
> ./49766_true.xls   49766 4617 0.09
> ./59719_true.xls   59719 6162 0.1
> ./71662_true.xls   71662 5705 0.08
> ./85994_true.xls   85994 7487 0.09
> ./103192_true.xls 103192 9060 0.09
> ./123830_true.xls 123830 11875 0.1
> ./148596_true.xls 148596 13591 0.09
>
>
> Original Poi (pre row order check)
> Notice that the AVG Mills/Row increases linearly with rows
> (O(n) which leads to O(n^2) as seen in the Total Time)
>
> File Name       Rows Total AVG
> Time Mills/
> Millis Row
> ./20000_true.xls   20000 4655 0.23
> ./24000_true.xls   24000 6398 0.27
> ./28800_true.xls   28800 7626 0.26
> ./34560_true.xls   34560 10178 0.29
> ./41472_true.xls   41472 17647 0.43
> ./49766_true.xls   49766 24645 0.5
> ./59719_true.xls   59719 38765 0.65
> ./71662_true.xls   71662 63009 0.88
> ./85994_true.xls   85994 98448 1.14
> ./103192_true.xls 103192 144313 1.4
> ./123830_true.xls 123830 217893 1.76
> ./148596_true.xls 148596 320606 2.16
>
>
>
> Thanks for your effort's Yegor.
> I will test this in the "wild" with real data soon.
>
>
> Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
> Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
> Confidential Fax 630-515-1908 * www.sungard.com/assetarena
>
>
>
> -----Original Message-----
> From: Yegor Kozlov [mailto:[hidden email]]
> Sent: Mon 5/24/2010 12:58 AM
> To: POI Developers List
> Cc: Alcock, Bryce
> Subject: Re: Performance Question with CTSheetDataImpl.java
>
> the purpose of calling CTSheetData#setRowArray is to ensure that the
> array of CTRow beans in CTSheetData is ordered. The point is that we
> don't always need this operation, in many cases rows are already ordered
> and re-assigning them to CTSheetData is an unnecessary expensive
> operation.
>
> Compare two use cases:
>
> case 1: rows are written in increasing order
>
> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
> sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
> to re-order rows before writing
>
>
> case 2: row are created in random order
>
> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
> sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
> re-order rows before writing
>
> In the first case the call of CTSheetData#setRowArray is extra because
> the state of CTSheetData match the logical model. In the second case we
> do need to call CTSheetData#setRowArray to ensure that rows are written
> in ascending order.
>
> So, the performance of XSSFSheet#write depends on how rows were added:
> if rows are added in strict ascending order then XSSFSheet#write is
> fast. If the order of rows is random or rows were shifted then
> XSSFSheet#write involves CTSheetData#setRowArray and becomes more
> expensive.
>
> I committed this improvement in r947542. Please try the latest build
> from trunk.
>
> Below are two my tests, the only difference is that in the first test
> rows are written in ascending order and in the second test in
> descending.
>
>       public static void main(String[] args) throws Exception {
>           long t1 = System.currentTimeMillis();
>
>           XSSFWorkbook wb = new XSSFWorkbook();
>           XSSFSheet sh = wb.createSheet();
>           for (int i = 0; i<  Short.MAX_VALUE; i++) {
>               XSSFRow row = sh.createRow(i);
>               for (int j = 0; j<  5; j++) {
>                   XSSFCell cell = row.createCell(j);
>                   cell.setCellValue(i*j);
>               }
>           }
>           long t2 = System.currentTimeMillis();
>           System.out.println("generate: " + (t2-t1) + " ms");
>
>           FileOutputStream out = new FileOutputStream(new
> File("/temp/rows.xlsx"));
>           wb.write(out);
>           out.close();
>           long t3 = System.currentTimeMillis();
>           System.out.println("write: " + (t3-t2) + " ms");
>       }
>
>
> generate: 5488 ms
> write: 2507 ms
>
>       public static void main(String[] args) throws Exception {
>           long t1 = System.currentTimeMillis();
>
>           XSSFWorkbook wb = new XSSFWorkbook();
>           XSSFSheet sh = wb.createSheet();
>           for (int i = Short.MAX_VALUE; i>  0; i--) {
>               XSSFRow row = sh.createRow(i);
>               for (int j = 0; j<  5; j++) {
>                   XSSFCell cell = row.createCell(j);
>                   cell.setCellValue(i*j);
>               }
>           }
>           long t2 = System.currentTimeMillis();
>           System.out.println("generate: " + (t2-t1) + " ms");
>
>           FileOutputStream out = new FileOutputStream(new
> File("/temp/rows.xlsx"));
>           wb.write(out); //will involve CTSheetData#setRowArray
>           out.close();
>           long t3 = System.currentTimeMillis();
>           System.out.println("write: " + (t3-t2) + " ms");
>       }
>
> generate: 9291 ms
> write: 94888 ms
>
> The first test is almost 40X faster!
>
> Regards,
> Yegor
>
>
>
>    
>> Continuing to chip away at CTSheetDataImpl's performance issues.
>> However, the problematic code is unlikely to be changed, I have a
>>      
> discussion going on the XMLBeans website, and as it turns out that code
> is most likely not going to change, or will not change easily.
>    
>>
>> With that in mind, I started taking a close look at the POI side of
>>      
> things and very specifically the problematic code from a POI
> perspective.
>    
>>
>> At approximately line 2368 of XSSFSheet.java
>>
>> the following line is the suspect line:
>> I have modified the code to break 1 line into 3  the slow line is the
>>      
> last one.
>    
>>           CTRow[] lctRow = new CTRow[rArray.size()];
>> lctRow = rArray.toArray(lctRow);
>>           sheetData.setRowArray(lctRow);
>>
>>
>>
>> My Question is:  in looking at XSSFSheet.java
>> are there any assumption that we can make about the state of sheetData
>>      
> (A CTSheetData object)
>    
>> that would allow us to either overwrite the default
>>      
> setRowArray(lctRow) method,
>    
>> or side step this call all together by calling a method that is not
>>      
> generic, but instead much
>    
>> more targeted to the function and process that is currently going on.
>>
>>
>>
>>
>>
>>
>>
>>
>> Here is the full call in context...Take from my modified version of
>>      
> XSSFSheet.java
>    
>>
>>       protected void write(OutputStream out) throws IOException {
>>
>>           if(worksheet.getColsArray().length == 1) {
>>               CTCols col = worksheet.getColsArray(0);
>>               CTCol[] cols = col.getColArray();
>>               if(cols.length == 0) {
>>                   worksheet.setColsArray(null);
>>               }
>>           }
>>
>>           // Now re-generate our CTHyperlinks, if needed
>>           if(hyperlinks.size()>   0) {
>>               if(worksheet.getHyperlinks() == null) {
>>                   worksheet.addNewHyperlinks();
>>               }
>>               CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>>               for(int i=0; i<ctHls.length; i++) {
>>                   // If our sheet has hyperlinks, have them add
>>                   //  any relationships that they might need
>>                   XSSFHyperlink hyperlink = hyperlinks.get(i);
>>                   hyperlink.generateRelationIfNeeded(getPackagePart());
>>                   // Now grab their underling object
>>                   ctHls[i] = hyperlink.getCTHyperlink();
>>               }
>>               worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>>           }
>>
>>           CTSheetData sheetData = worksheet.getSheetData();
>>           ArrayList<CTRow>   rArray = new ArrayList<CTRow>(rows.size());
>>           for(XSSFRow row : rows.values()){
>>               row.onDocumentWrite();
>>               rArray.add(row.getCTRow());
>>           }
>> long startTimeSheetData = System.currentTimeMillis();
>> CTRow[] lctRow = new CTRow[rArray.size()];
>> lctRow = rArray.toArray(lctRow);
>>           sheetData.setRowArray(lctRow);
>> System.out.println("Sheet Time : " +
>>      
> (System.currentTimeMillis() - startTimeSheetData));
>    
>>
>>           XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>>           xmlOptions.setSaveSyntheticDocumentElement(new
>>      
> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>    
>>           Map<String, String>   map = new HashMap<String, String>();
>>           map.put(STRelationshipId.type.getName().getNamespaceURI(),
>>      
> "r");
>    
>>           xmlOptions.setSaveSuggestedPrefixes(map);
>> long startTimeWorkSheetSave =
>>      
> System.currentTimeMillis();
>    
>>           worksheet.save(out, xmlOptions);
>> System.out.println("Worksheet save Time : " +
>>      
> (System.currentTimeMillis() - startTimeWorkSheetSave));
>    
>>       }
>>
>>
>>
>> Best Regards
>> Bryce Alcock
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>      
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance Question with CTSheetDataImpl.java

David Fisher

On May 29, 2010, at 4:21 AM, Yegor Kozlov wrote:

> On 5/27/10 12:49 AM, [hidden email] wrote:
>> Couple of thoughts:
>>
>> 1.  I did a test with "real" data in a controlled environment, and
>> Identified that in the real world condition this change did help
>> significantly for large sheets that did not require re-ordering.  12 Min
>> processing time dropped to 4ish Minutes.
>>
>>  
>
> I made a similar improvement in XSSFRow#onDocumentWrite(), this method always re-assigned the array of CTCell even if the cells were ordered.
> The performance boost is not so big as we got from the optimization of XSSFSheet#write. The number of cells is usually small and the call of CTRow#setCArray is not that expensive, the improvement should be significant with 'wide' sheets having more than 100 columns.
>
> The fix was committed in r949375
>> 2.  I have been giving it a bit more thought, and would like to through
>> the idea of "block" re-order comparison for a further improvements where
>> the data does not match.  The basic idea is that because the rows are
>> maintained in a linked list in xml beans, can we on the poi side of
>> things look at only comparing blocks of data and doing some of the list
>> manipulation as needed, but skip blocks where we know or can predict
>> that the underlying list is ordered correctly.
>>
>> I am just floating this idea to the list.  I will not be able to dig
>> into it for a bit...
>> I would love to hear thoughts?
>>
>>  
> I don't quite understand. Are you suggesting to manipulate low-level DOM objects and swap only those pieces that are not sorted?
> In theory you can do that using XmlCursor but the code will be tricky and harder to maintain. Also, the performance gain is not evident to me.
>
> An alternative approach is to improve XSSFSheet#createRow so that the array of CTRow beans is kept in a sorted state. Current implementation appends new rows, this is why we need the trick with re-ordering to ensure that the resulting XML is valid. If we do this optimization then the re-ordering trick is extra and can be removed. The performance of XSSFSheet#createRow can be slower (random insertions in LinkedList!) but the performance of XSSFSheet#write will be definitely faster.

Why a Single LinkedList? If you do use one a linked list - then try a Double-Linked and insert from the end. This will make in order insertions O(1) - instead of O(n).

Regards,
Dave




>
> Yegor
>> Bryce Alcock * Performance Architect * SunGard * Asset Arena *
>> 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
>> www.sungard.com/assetarena
>>
>> Think before you print
>>
>>
>>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
>> Sent: Monday, May 24, 2010 10:06 AM
>> To: [hidden email]; [hidden email]
>> Subject: RE: Performance Question with CTSheetDataImpl.java
>>
>> The performance improvement is significant for my simple "Laboratory"
>> test.
>> See attached PNG file showing Yegor's code vs. Poi before the change.
>>
>>
>> Yegor's Code Changes Inplace:
>> Note that the AVG Time per Row does not change (O(1) per row leads to
>> O(n) over all)
>>
>> File Name       Rows Total AVG
>> Time Mills/
>> Millis Row
>> ./20000_true.xls   20000 1964 0.1
>> ./24000_true.xls   24000 2019 0.08
>> ./28800_true.xls   28800 2624 0.09
>> ./34560_true.xls   34560 3012 0.09
>> ./41472_true.xls   41472 3409 0.08
>> ./49766_true.xls   49766 4617 0.09
>> ./59719_true.xls   59719 6162 0.1
>> ./71662_true.xls   71662 5705 0.08
>> ./85994_true.xls   85994 7487 0.09
>> ./103192_true.xls 103192 9060 0.09
>> ./123830_true.xls 123830 11875 0.1
>> ./148596_true.xls 148596 13591 0.09
>>
>>
>> Original Poi (pre row order check)
>> Notice that the AVG Mills/Row increases linearly with rows
>> (O(n) which leads to O(n^2) as seen in the Total Time)
>>
>> File Name       Rows Total AVG
>> Time Mills/
>> Millis Row
>> ./20000_true.xls   20000 4655 0.23
>> ./24000_true.xls   24000 6398 0.27
>> ./28800_true.xls   28800 7626 0.26
>> ./34560_true.xls   34560 10178 0.29
>> ./41472_true.xls   41472 17647 0.43
>> ./49766_true.xls   49766 24645 0.5
>> ./59719_true.xls   59719 38765 0.65
>> ./71662_true.xls   71662 63009 0.88
>> ./85994_true.xls   85994 98448 1.14
>> ./103192_true.xls 103192 144313 1.4
>> ./123830_true.xls 123830 217893 1.76
>> ./148596_true.xls 148596 320606 2.16
>>
>>
>>
>> Thanks for your effort's Yegor.
>> I will test this in the "wild" with real data soon.
>>
>>
>> Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
>> Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
>> Confidential Fax 630-515-1908 * www.sungard.com/assetarena
>>
>>
>>
>> -----Original Message-----
>> From: Yegor Kozlov [mailto:[hidden email]]
>> Sent: Mon 5/24/2010 12:58 AM
>> To: POI Developers List
>> Cc: Alcock, Bryce
>> Subject: Re: Performance Question with CTSheetDataImpl.java
>>
>> the purpose of calling CTSheetData#setRowArray is to ensure that the
>> array of CTRow beans in CTSheetData is ordered. The point is that we
>> don't always need this operation, in many cases rows are already ordered
>> and re-assigning them to CTSheetData is an unnecessary expensive
>> operation.
>>
>> Compare two use cases:
>>
>> case 1: rows are written in increasing order
>>
>> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
>> sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
>> to re-order rows before writing
>>
>>
>> case 2: row are created in random order
>>
>> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
>> sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
>> re-order rows before writing
>>
>> In the first case the call of CTSheetData#setRowArray is extra because
>> the state of CTSheetData match the logical model. In the second case we
>> do need to call CTSheetData#setRowArray to ensure that rows are written
>> in ascending order.
>>
>> So, the performance of XSSFSheet#write depends on how rows were added:
>> if rows are added in strict ascending order then XSSFSheet#write is
>> fast. If the order of rows is random or rows were shifted then
>> XSSFSheet#write involves CTSheetData#setRowArray and becomes more
>> expensive.
>>
>> I committed this improvement in r947542. Please try the latest build
>> from trunk.
>>
>> Below are two my tests, the only difference is that in the first test
>> rows are written in ascending order and in the second test in
>> descending.
>>
>>      public static void main(String[] args) throws Exception {
>>          long t1 = System.currentTimeMillis();
>>
>>          XSSFWorkbook wb = new XSSFWorkbook();
>>          XSSFSheet sh = wb.createSheet();
>>          for (int i = 0; i<  Short.MAX_VALUE; i++) {
>>              XSSFRow row = sh.createRow(i);
>>              for (int j = 0; j<  5; j++) {
>>                  XSSFCell cell = row.createCell(j);
>>                  cell.setCellValue(i*j);
>>              }
>>          }
>>          long t2 = System.currentTimeMillis();
>>          System.out.println("generate: " + (t2-t1) + " ms");
>>
>>          FileOutputStream out = new FileOutputStream(new
>> File("/temp/rows.xlsx"));
>>          wb.write(out);
>>          out.close();
>>          long t3 = System.currentTimeMillis();
>>          System.out.println("write: " + (t3-t2) + " ms");
>>      }
>>
>>
>> generate: 5488 ms
>> write: 2507 ms
>>
>>      public static void main(String[] args) throws Exception {
>>          long t1 = System.currentTimeMillis();
>>
>>          XSSFWorkbook wb = new XSSFWorkbook();
>>          XSSFSheet sh = wb.createSheet();
>>          for (int i = Short.MAX_VALUE; i>  0; i--) {
>>              XSSFRow row = sh.createRow(i);
>>              for (int j = 0; j<  5; j++) {
>>                  XSSFCell cell = row.createCell(j);
>>                  cell.setCellValue(i*j);
>>              }
>>          }
>>          long t2 = System.currentTimeMillis();
>>          System.out.println("generate: " + (t2-t1) + " ms");
>>
>>          FileOutputStream out = new FileOutputStream(new
>> File("/temp/rows.xlsx"));
>>          wb.write(out); //will involve CTSheetData#setRowArray
>>          out.close();
>>          long t3 = System.currentTimeMillis();
>>          System.out.println("write: " + (t3-t2) + " ms");
>>      }
>>
>> generate: 9291 ms
>> write: 94888 ms
>>
>> The first test is almost 40X faster!
>>
>> Regards,
>> Yegor
>>
>>
>>
>>  
>>> Continuing to chip away at CTSheetDataImpl's performance issues.
>>> However, the problematic code is unlikely to be changed, I have a
>>>    
>> discussion going on the XMLBeans website, and as it turns out that code
>> is most likely not going to change, or will not change easily.
>>  
>>>
>>> With that in mind, I started taking a close look at the POI side of
>>>    
>> things and very specifically the problematic code from a POI
>> perspective.
>>  
>>>
>>> At approximately line 2368 of XSSFSheet.java
>>>
>>> the following line is the suspect line:
>>> I have modified the code to break 1 line into 3  the slow line is the
>>>    
>> last one.
>>  
>>>          CTRow[] lctRow = new CTRow[rArray.size()];
>>> lctRow = rArray.toArray(lctRow);
>>>          sheetData.setRowArray(lctRow);
>>>
>>>
>>>
>>> My Question is:  in looking at XSSFSheet.java
>>> are there any assumption that we can make about the state of sheetData
>>>    
>> (A CTSheetData object)
>>  
>>> that would allow us to either overwrite the default
>>>    
>> setRowArray(lctRow) method,
>>  
>>> or side step this call all together by calling a method that is not
>>>    
>> generic, but instead much
>>  
>>> more targeted to the function and process that is currently going on.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Here is the full call in context...Take from my modified version of
>>>    
>> XSSFSheet.java
>>  
>>>
>>>      protected void write(OutputStream out) throws IOException {
>>>
>>>          if(worksheet.getColsArray().length == 1) {
>>>              CTCols col = worksheet.getColsArray(0);
>>>              CTCol[] cols = col.getColArray();
>>>              if(cols.length == 0) {
>>>                  worksheet.setColsArray(null);
>>>              }
>>>          }
>>>
>>>          // Now re-generate our CTHyperlinks, if needed
>>>          if(hyperlinks.size()>   0) {
>>>              if(worksheet.getHyperlinks() == null) {
>>>                  worksheet.addNewHyperlinks();
>>>              }
>>>              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>>>              for(int i=0; i<ctHls.length; i++) {
>>>                  // If our sheet has hyperlinks, have them add
>>>                  //  any relationships that they might need
>>>                  XSSFHyperlink hyperlink = hyperlinks.get(i);
>>>                  hyperlink.generateRelationIfNeeded(getPackagePart());
>>>                  // Now grab their underling object
>>>                  ctHls[i] = hyperlink.getCTHyperlink();
>>>              }
>>>              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>>>          }
>>>
>>>          CTSheetData sheetData = worksheet.getSheetData();
>>>          ArrayList<CTRow>   rArray = new ArrayList<CTRow>(rows.size());
>>>          for(XSSFRow row : rows.values()){
>>>              row.onDocumentWrite();
>>>              rArray.add(row.getCTRow());
>>>          }
>>> long startTimeSheetData = System.currentTimeMillis();
>>> CTRow[] lctRow = new CTRow[rArray.size()];
>>> lctRow = rArray.toArray(lctRow);
>>>          sheetData.setRowArray(lctRow);
>>> System.out.println("Sheet Time : " +
>>>    
>> (System.currentTimeMillis() - startTimeSheetData));
>>  
>>>
>>>          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>>>          xmlOptions.setSaveSyntheticDocumentElement(new
>>>    
>> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>>  
>>>          Map<String, String>   map = new HashMap<String, String>();
>>>          map.put(STRelationshipId.type.getName().getNamespaceURI(),
>>>    
>> "r");
>>  
>>>          xmlOptions.setSaveSuggestedPrefixes(map);
>>> long startTimeWorkSheetSave =
>>>    
>> System.currentTimeMillis();
>>  
>>>          worksheet.save(out, xmlOptions);
>>> System.out.println("Worksheet save Time : " +
>>>    
>> (System.currentTimeMillis() - startTimeWorkSheetSave));
>>  
>>>      }
>>>
>>>
>>>
>>> Best Regards
>>> Bryce Alcock
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>>    
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12