Support for the Apache Parquet file format

Shadi Akiki shadi at autofitcloud.com
Fri Nov 22 13:27:02 UTC 2019


 > the benefits that this format brings

Its column-based format means that its data can be queried without 
loading the full file.

More can be found at [1]

I see 2 distinct advantages:

1. Convenience: sometimes I build a programmatic process that spits out 
a bunch of parquet files, then I query them with AWS Athena or Apache 
Drill. If I want to peak into the parquet file, it requires to either 
write up a pandas script or to open it with visidata. If I find a 
problem with the file, I need to go back to the process that generated 
it, modify it, and re-generate the file (or write a script specifically 
for editing the file). If I could just double-click, edit, save, it 
would be so much easier. That's a major advantage for CSV despite its 
inefficiency in query/filtering performance.

2. Performance: On the other hand, a spreadsheet editor might not be 
designed to exploit this column-based format for better efficiency. It's 
expected to open the whole file anyway. Maybe filtering the worksheet 
with parquet would be faster than with CSV, but that depends on how the 
filtering is implemented. I have no idea how it's done in Calc or other 
editors. We all know the dread of opening a large file in a spreadsheet 
editor. But then again, maybe that's when the data should be moved into 
a database rather than stay in a heavy spreadsheet.


 > https://github.com/apache/parquet-format

 > The best place to learn about the specifics of this file format

Yes that's it. I don't want to sound self-contradictory, but maybe it's 
NOT a good idea to support Parquet. I was just bringing it up, and maybe 
this needs some more thought about the degree of usefulness or if people 
will actually use it. Chicken-and-egg problem?


Links:

[1] 
https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

Shadi Akiki
Founder & CEO, AutofitCloud
https://autofitcloud.com/
+1 813 579 4935

On 11/22/19 2:42 PM, Kohei Yoshida wrote:
> On 22.11.2019 02:37, Shadi Akiki wrote:
>
>> I'm wondering why Parquet is not yet a supported format in LibreOffice
>> Calc (and most desktop worksheet processing tools for that matter).
>
> Well, one reason may be that nobody had asked for it yet!  On that 
> note, asking about it and raising awareness (which you did) is a 
> necessary first step.
>
> Also, it would be nice to know the benefits that this format brings 
> that any other existing formats currently do not.  I use pandas 
> occasionally and I do work with people who use it on a regular basis, 
> but I had not heard this file format mentioned in our conversations to 
> this day.
>
> Is this page
>
> https://github.com/apache/parquet-format
>
> The best place to learn about the specifics of this file format, or is 
> there any other page that provides more details?
>
> One way we can add support for a new file format such as this one to 
> Calc is to add it to the orcus library [1], which Calc uses internally 
> to handle a subset of file formats.  That may potentially be a much 
> easier route than adding it to the LibreOffice code base directly... 
> Full disclosure: I do maintain this library.
>
> Kohei
>
> [1] https://gitlab.com/orcus/orcus
>


More information about the LibreOffice mailing list