POI memory usage for 4MB exce file is high

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

POI memory usage for 4MB exce file is high

Mahmood Naderan
Hi

I used the Apache POI to read an excel file. The file has 16000 rows and 50 columns.


Each cell is read as a string and they are put into a 2D array


(e.g. a row is a string array with size 50 and 16000 rows are put in another array).


the code looks like



try (FileInputStream fIP = new FileInputStream(selectedFile)) {
  XSSFWorkbook wb = new XSSFWorkbook(fIP);
  XSSFSheet ws = wb.getSheetAt(0);
  maxRows = ws.getLastRowNum() + 1;
  maxColumns = ws.getRow(0).getLastCellNum();
  theRows = new OneRow[maxRows];
  Iterator< Row > rowIt = ws.iterator();
  XSSFRow row;
  int i = 0;
  long chs = 0;
  while ( rowIt.hasNext() ) {
    row = (XSSFRow) rowIt.next();
    Iterator< Cell > cellIt = row.cellIterator();
    String [] str = new String[maxColumns];
    int j = 0;
    long ch = 0;
    while ( cellIt.hasNext() ) {
      Cell cell = cellIt.next();
      str[ j ] = cell.getStringCellValue();
             ch += str[ j ].length(); // sum one row's length
      System.out.print( str[ j ].length() + " " );
      ++j;
    }
    System.out.print( "\n  ->" + ch + "\n" );
    chs += ch;   // sum total characters length
    theRows[ i ] = new OneRow( maxColumns );
    theRows[ i ].add( str );
    ++i;
  }
  System.out.print( "\n  ==>" + chs + "\n" );
}



the program also counts the length of each cell, ch and the size of all


rows are updated with chs += ch. Then the total number of characters are chs.


The value is 3,230,798 (3M chars) but the used heap is about 1GB. That means a


character is consuming 330 bytes!! (more or less).



Using the profilers, I see that needs a lot of heap and I have to increase


the heap size to 2048. Also the cpu time is devoted to GC which is undesirable.



I have to say that XSSFSheet ws = wb.getSheetAt(0); is the most time and


memory consuming part. In order to investigate that, I manually count the data


structure sizes with ch and chs.




Please see the profiler at  https://i.stack.imgur.com/ZfLHj.jpg

In the memory chart and from left, the first peak is where it finishes


loading the excel file.



 Regards,
Mahmood

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: POI memory usage for 4MB exce file is high

Andreas Reichel
Mahmood,

that was discussed many times already: XSSFWorkbook is a memory hog as it holds the whole information in a DOM tree with a lot of strings. (Does Java 8 String Deduplication help on that, when you have enough cores?)
For cases like yours reading from a SXSSFWorkbook/stream will be more suitable. You also might have a look at: https://github.com/monitorjbl/excel-streaming-reader eventually.

Best regards
Andreas

On Wed, 2017-05-03 at 15:40 +0000, Mahmood Naderan wrote:
Hi

I used the Apache POI to read an excel file. The file has 16000 rows and 50 columns. 


Each cell is read as a string and they are put into a 2D array 


(e.g. a row is a string array with size 50 and 16000 rows are put in another array). 


the code looks like



try (FileInputStream fIP = new FileInputStream(selectedFile)) {
  XSSFWorkbook wb = new XSSFWorkbook(fIP);
  XSSFSheet ws = wb.getSheetAt(0);
  maxRows = ws.getLastRowNum() + 1;
  maxColumns = ws.getRow(0).getLastCellNum();
  theRows = new OneRow[maxRows]; 
  Iterator< Row > rowIt = ws.iterator();
  XSSFRow row;
  int i = 0;
  long chs = 0;
  while ( rowIt.hasNext() ) {
    row = (XSSFRow) rowIt.next();
    Iterator< Cell > cellIt = row.cellIterator();
    String [] str = new String[maxColumns];
    int j = 0;
    long ch = 0;
    while ( cellIt.hasNext() ) {
      Cell cell = cellIt.next();
      str[ j ] = cell.getStringCellValue();
             ch += str[ j ].length(); // sum one row's length
      System.out.print( str[ j ].length() + " " );
      ++j;
    }
    System.out.print( "\n  ->" + ch + "\n" );
    chs += ch;   // sum total characters length
    theRows[ i ] = new OneRow( maxColumns );
    theRows[ i ].add( str );
    ++i;
  }
  System.out.print( "\n  ==>" + chs + "\n" );
}



the program also counts the length of each cell, ch and the size of all 


rows are updated with chs += ch. Then the total number of characters are chs. 


The value is 3,230,798 (3M chars) but the used heap is about 1GB. That means a 


character is consuming 330 bytes!! (more or less).



Using the profilers, I see that needs a lot of heap and I have to increase 


the heap size to 2048. Also the cpu time is devoted to GC which is undesirable.



I have to say that XSSFSheet ws = wb.getSheetAt(0); is the most time and 


memory consuming part. In order to investigate that, I manually count the data 


structure sizes with ch and chs.




Please see the profiler at  https://i.stack.imgur.com/ZfLHj.jpg

In the memory chart and from left, the first peak is where it finishes 


loading the excel file.



 Regards,
Mahmood

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: POI memory usage for 4MB exce file is high

Mahmood Naderan-2
Thanks. I am new to that and have not used Maven yet. I have downloaded the jar file
https://github.com/monitorjbl/excel-streaming-reader/releases/download/1.1.0/xlsx-streamer-1.1.0.jar

and put that in the project folder. It is said to add


<dependencies>
<dependency>
<groupId>com.monitorjbl</groupId>
<artifactId>xlsx-streamer</artifactId>
<version>1.0.1</version>
</dependency>
</dependencies>

to the pom.xml. I see one pom.xml in the project folder which looks like

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.scu</groupId>
<artifactId>testmaven</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
</project>


where should I add those lines? in the properties tag?
Regards,
Mahmood





On Wednesday, May 3, 2017 8:34 PM, Andreas Reichel <[hidden email]> wrote:



Mahmood,

that was discussed many times already: XSSFWorkbook is a memory hog as it holds the whole information in a DOM tree with a lot of strings. (Does Java 8 String Deduplication help on that, when you have enough cores?)
For cases like yours reading from a SXSSFWorkbook/stream will be more suitable. You also might have a look at: https://github.com/monitorjbl/excel-streaming-reader eventually.

Best regards
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...