Perculiar problem reading validations from a XLSX exported from Google Sheets

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Perculiar problem reading validations from a XLSX exported from Google Sheets

Stuart Owen-3
Hi,

I've encountered an odd problem reading validations from an xlsx
spreadsheet uploaded to Google Drive, and then exported.
To cut a long story short, my particular use-case involves a pipeline of
creating an Excel spreadsheet, which is then shared and used via Google
Sheets to allow scientists to collaborate and annotate biological data
(using data validations), and is then exported and the annotations read
via Apache POI.

I have found that Apache POI can detect the validations in the original
spreadsheet, but not in the exported version. The validations are read
via XSSFSheet.getDataValidations(). The validations however, can be read
fine when opening in either Excel or LibreOffice, and if resaved can
then again be read by Apache POI.

To clarify things, I've put together a test case to demonstrate the
problem at https://github.com/stuzart/poi-google-export-testcase

I'm not sure if this is a problem with Google export, or Apache POI.

thanks,

Stuart

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Andreas Reichel
Stuart,

maybe you could provide the simpliest possible test case, e. g.
creating your form with only 2 columns and 1 row, add the validation
and then:

1) extract the XML after your created the template
2) extract the XML after you have exporting it from Google spreadsheets
3) extract the XML after importing and re-exporting it from Libreoffice

Then compare the XML files and see what is actually different.
Also I do not understand yet, why do you want to read the validation in
POI again? In my limited understanding, the Google Spreadsheet will
validate the user inputs already and assure, you will read a valid
file?

Possible work-arounds in the meantime:

1) when opening the Google Spreadsheet file, read the Validations from
the original XLSX file in parallel and merge that information
(e.g. take the data from the Google Spreadsheet, but the structure and
definitions from the original XLSX).

2) alternatively, run it automated through Libreoffice first (it has a
command-line/shell only option)

Best regards
Andreas

On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:

> Hi,
> I've encountered an odd problem reading validations from an xlsx
> spreadsheet uploaded to Google Drive, and then exported.To cut a long
> story short, my particular use-case involves a pipeline of creating
> an Excel spreadsheet, which is then shared and used via Google Sheets
> to allow scientists to collaborate and annotate biological data
> (using data validations), and is then exported and the annotations
> read via Apache POI.
> I have found that Apache POI can detect the validations in the
> original spreadsheet, but not in the exported version. The
> validations are read via XSSFSheet.getDataValidations(). The
> validations however, can be read fine when opening in either Excel or
> LibreOffice, and if resaved can then again be read by Apache POI.
> To clarify things, I've put together a test case to demonstrate the
> problem at https://github.com/stuzart/poi-google-export-testcase
>
> I'm not sure if this is a problem with Google export, or Apache POI.
> thanks,
> Stuart
> -------------------------------------------------------------------
> --To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Stuart Owen-3
Hi,

comments below ...

On 17/09/2020 11:48, Andreas Reichel wrote:
> Stuart,
>
> maybe you could provide the simpliest possible test case, e. g.
> creating your form with only 2 columns and 1 row, add the validation
> and then:
I found when creating a sheet as simple as you suggest, POI wasn't able
find the validations in the original either. Looking at the unpacked xml
for the sheet, the validation is defined in a weird looking block:
<ext uri="{CCE6A557-97BC-4b89-ADB6-D9C93CAAB3DF}"
xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"><x14:dataValidations
count="1"
xmlns:xm="http://schemas.microsoft.com/office/excel/2006/main"><x14:dataValidation
type="list" allowBlank="1" showInputMessage="1"
showErrorMessage="1"><x14:formula1><xm:f>Sheet2!$A$1</xm:f></x14:formula1><xm:sqref>A1</xm:sqref></x14:dataValidation></x14:dataValidations></ext>
</extLst>
(I added the file to the github test case here:
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-very-simple.xlsx)

>
> 1) extract the XML after your created the template
> 2) extract the XML after you have exporting it from Google spreadsheets
> 3) extract the XML after importing and re-exporting it from Libreoffice
>
> Then compare the XML files and see what is actually different.

So, sticking with the original files I have unpacked the xlsx and taken
a look at the xml differences. There are actually a lot of differences,
too much to do a simple diff. However manually inspecting, I've noticed
a difference in the dataValidation validation block for the sheet1.xml
that looks suspicous. In the problematic exported case, it seems to be
missing theĀ  count="1" attribute, whereas the other cases that work do
have this.
i.e
original -
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
exported -
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
resaved -
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007

I'm not familiar with the format enough to know if this is legal or not.
> Also I do not understand yet, why do you want to read the validation in
> POI again? In my limited understanding, the Google Spreadsheet will
> validate the user inputs already and assure, you will read a valid
> file?
I work on a tool that is based on embedding OWL ontologies (semantic web
stuff) into spreadsheets, allowing scientists (mostly biologists) to
annotate data in a way that is user friendly and familiar to them, and
hides the nasty stuff. We use a trick with data validation to map human
readable labels, with their sementic URI identifier, and then be able to
export this again. Using data validations allows us to do this without
any additional macros or vb scripting. The exported annotations and data
can then be fed into a database as a graph, and be queried and reasoned
across. You can find out more at https://rightfield.org.uk/

>
> Possible work-arounds in the meantime:
>
> 1) when opening the Google Spreadsheet file, read the Validations from
> the original XLSX file in parallel and merge that information
> (e.g. take the data from the Google Spreadsheet, but the structure and
> definitions from the original XLSX).
>
> 2) alternatively, run it automated through Libreoffice first (it has a
> command-line/shell only option)
Yes, option 2 is something I am thinking of as a fallback option.

many thanks,
Stuart

>
> Best regards
> Andreas
>
> On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:
>> Hi,
>> I've encountered an odd problem reading validations from an xlsx
>> spreadsheet uploaded to Google Drive, and then exported.To cut a long
>> story short, my particular use-case involves a pipeline of creating
>> an Excel spreadsheet, which is then shared and used via Google Sheets
>> to allow scientists to collaborate and annotate biological data
>> (using data validations), and is then exported and the annotations
>> read via Apache POI.
>> I have found that Apache POI can detect the validations in the
>> original spreadsheet, but not in the exported version. The
>> validations are read via XSSFSheet.getDataValidations(). The
>> validations however, can be read fine when opening in either Excel or
>> LibreOffice, and if resaved can then again be read by Apache POI.
>> To clarify things, I've put together a test case to demonstrate the
>> problem at https://github.com/stuzart/poi-google-export-testcase
>>
>> I'm not sure if this is a problem with Google export, or Apache POI.
>> thanks,
>> Stuart
>> -------------------------------------------------------------------
>> --To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Andreas Reichel

On Thu, 2020-09-17 at 16:26 +0100, Stuart Owen wrote:
> original -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
>
> exported -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
>
> resaved -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007

Stuart,

maybe you would like to open a ticket in the POI's bug tracker with
your finding.
In the meantime, you could try to add the missing count="1" attribute
by yourself? For example:


1) unzip the XLSX
2) find the xml node dataValidations, count the child elements and add
the count="..." accordingly
3) zip into the XLSX file again


If your files are small enough, than I would read the complete DOM of
the XML.
If  your files are large, then maybe GREP/SED could be helpful.


Maybe test your/our assumption first with a test case having count="3".
When editing the XML manually and adding the count="..." attribute
helps, it was definitely worth to file a ticket.


Good luck and best regards
Andreas


Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Stuart Owen-3
In reply to this post by Stuart Owen-3


On 17/09/2020 16:26, Stuart Owen wrote:

> Hi,
>
> comments below ...
>
> On 17/09/2020 11:48, Andreas Reichel wrote:
>> Stuart,
>>
>> maybe you could provide the simpliest possible test case, e. g.
>> creating your form with only 2 columns and 1 row, add the validation
>> and then:
> I found when creating a sheet as simple as you suggest, POI wasn't
> able find the validations in the original either. Looking at the
> unpacked xml for the sheet, the validation is defined in a weird
> looking block:
> <ext uri="{CCE6A557-97BC-4b89-ADB6-D9C93CAAB3DF}"
> xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"><x14:dataValidations
> count="1"
> xmlns:xm="http://schemas.microsoft.com/office/excel/2006/main"><x14:dataValidation
> type="list" allowBlank="1" showInputMessage="1"
> showErrorMessage="1"><x14:formula1><xm:f>Sheet2!$A$1</xm:f></x14:formula1><xm:sqref>A1</xm:sqref></x14:dataValidation></x14:dataValidations></ext>
>
> </extLst>
> (I added the file to the github test case here:
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-very-simple.xlsx)
>
>>
>> 1) extract the XML after your created the template
>> 2) extract the XML after you have exporting it from Google spreadsheets
>> 3) extract the XML after importing and re-exporting it from Libreoffice
>>
>> Then compare the XML files and see what is actually different.
>
> So, sticking with the original files I have unpacked the xlsx and
> taken a look at the xml differences. There are actually a lot of
> differences, too much to do a simple diff. However manually
> inspecting, I've noticed a difference in the dataValidation validation
> block for the sheet1.xml that looks suspicous. In the problematic
> exported case, it seems to be missing theĀ  count="1" attribute,
> whereas the other cases that work do have this.
> i.e
> original -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
> exported -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
> resaved -
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007
>
> I'm not familiar with the format enough to know if this is legal or not.

to add, I've tried validating the xml against its schema, and it appears
to be legal.

>
>> Also I do not understand yet, why do you want to read the validation in
>> POI again? In my limited understanding, the Google Spreadsheet will
>> validate the user inputs already and assure, you will read a valid
>> file?
> I work on a tool that is based on embedding OWL ontologies (semantic
> web stuff) into spreadsheets, allowing scientists (mostly biologists)
> to annotate data in a way that is user friendly and familiar to them,
> and hides the nasty stuff. We use a trick with data validation to map
> human readable labels, with their sementic URI identifier, and then be
> able to export this again. Using data validations allows us to do this
> without any additional macros or vb scripting. The exported
> annotations and data can then be fed into a database as a graph, and
> be queried and reasoned across. You can find out more at
> https://rightfield.org.uk/
>>
>> Possible work-arounds in the meantime:
>>
>> 1) when opening the Google Spreadsheet file, read the Validations from
>> the original XLSX file in parallel and merge that information
>> (e.g. take the data from the Google Spreadsheet, but the structure and
>> definitions from the original XLSX).
>>
>> 2) alternatively, run it automated through Libreoffice first (it has a
>> command-line/shell only option)
> Yes, option 2 is something I am thinking of as a fallback option.
>
> many thanks,
> Stuart
>>
>> Best regards
>> Andreas
>>
>> On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:
>>> Hi,
>>> I've encountered an odd problem reading validations from an xlsx
>>> spreadsheet uploaded to Google Drive, and then exported.To cut a long
>>> story short, my particular use-case involves a pipeline of creating
>>> an Excel spreadsheet, which is then shared and used via Google Sheets
>>> to allow scientists to collaborate and annotate biological data
>>> (using data validations), and is then exported and the annotations
>>> read via Apache POI.
>>> I have found that Apache POI can detect the validations in the
>>> original spreadsheet, but not in the exported version. The
>>> validations are read via XSSFSheet.getDataValidations(). The
>>> validations however, can be read fine when opening in either Excel or
>>> LibreOffice, and if resaved can then again be read by Apache POI.
>>> To clarify things, I've put together a test case to demonstrate the
>>> problem at https://github.com/stuzart/poi-google-export-testcase
>>>
>>> I'm not sure if this is a problem with Google export, or Apache POI.
>>> thanks,
>>> Stuart
>>> -------------------------------------------------------------------
>>> --To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Stuart Owen-3
In reply to this post by Andreas Reichel


On 17/09/2020 16:42, Andreas Reichel wrote:

> On Thu, 2020-09-17 at 16:26 +0100, Stuart Owen wrote:
>> original -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
>>
>> exported -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
>>
>> resaved -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007
> Stuart,
>
> maybe you would like to open a ticket in the POI's bug tracker with
> your finding.
> In the meantime, you could try to add the missing count="1" attribute
> by yourself? For example:
>
>
> 1) unzip the XLSX
> 2) find the xml node dataValidations, count the child elements and add
> the count="..." accordingly
> 3) zip into the XLSX file again
>
>
> If your files are small enough, than I would read the complete DOM of
> the XML.
> If  your files are large, then maybe GREP/SED could be helpful.
>
>
> Maybe test your/our assumption first with a test case having count="3".
> When editing the XML manually and adding the count="..." attribute
> helps, it was definitely worth to file a ticket.

Yep, OK. I will try this tomorrow and add the information to a ticket.

many thanks,
Stuart
>
>
> Good luck and best regards
> Andreas
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Andreas Reichel
In reply to this post by Stuart Owen-3
Stuart,

On Thu, 2020-09-17 at 17:07 +0100, Stuart Owen wrote:
> > I'm not familiar with the format enough to know if this is legal or not.
>
> to add, I've tried validating the xml against its schema, and it appears
>
> to be legal.


I do not doubt that. I would suspect that POI is falsely expecting that
attribute and interprets it as Zero when missing, not reading the
actual children.
Might be related to SAX vs. DOM, so they can't count the children when
that attribute is missing.

Best regards
Andreas