Figure out which cells are merged in a streaming fashion

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Figure out which cells are merged in a streaming fashion

Tomasz Melcer
Hi,

I am trying to parse large (many sheets with hundreds of thousands of rows and tens of columns) spreadsheet files, xlsx and xlsb, in a streaming fashion. The spreadsheet is a standard table with a two-row header, where the upper row are merged cells representing groups of columns, and the lower row contains names of columns within a group. I need to identify the columns I need to parse before reading the data, and for that I need to know which group a given column belongs to.

However, I cannot find information on how to recognize merged cells in a streaming fashion. Is it even possible? As far as I know, in case of xlsx files, information on which cells are in a merged region is stored in the <mergeCells> element of the sheet xml, and in my example files this element is at the end of the file.

Thank you,

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Figure out which cells are merged in a streaming fashion

Tomasz Melcer
On 2020/03/20 18:26:04, Tomasz Melcer <[hidden email]> wrote:
> However, I cannot find information on how to recognize merged cells in a streaming fashion. Is it even possible? As far as I know, in case of xlsx files, information on which cells are in a merged region is stored in the <mergeCells> element of the sheet xml, and in my example files this element is at the end of the file.

I ended up writing my own parser that using XSSFReader seeks the following template:

<mergeCells count="2">
  <mergeCell ref="A1:B2" />
  <mergeCell ref="C3:D4" />
</mergeCells>

I do two runs over each sheet: first to detect merged cells, second to actually parse content. So far it works, and the first run is surprisingly fast.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]