Support for the Apache Parquet file format
Shadi Akiki
shadi at autofitcloud.com
Fri Nov 22 13:27:02 UTC 2019
> the benefits that this format brings
Its column-based format means that its data can be queried without
loading the full file.
More can be found at [1]
I see 2 distinct advantages:
1. Convenience: sometimes I build a programmatic process that spits out
a bunch of parquet files, then I query them with AWS Athena or Apache
Drill. If I want to peak into the parquet file, it requires to either
write up a pandas script or to open it with visidata. If I find a
problem with the file, I need to go back to the process that generated
it, modify it, and re-generate the file (or write a script specifically
for editing the file). If I could just double-click, edit, save, it
would be so much easier. That's a major advantage for CSV despite its
inefficiency in query/filtering performance.
2. Performance: On the other hand, a spreadsheet editor might not be
designed to exploit this column-based format for better efficiency. It's
expected to open the whole file anyway. Maybe filtering the worksheet
with parquet would be faster than with CSV, but that depends on how the
filtering is implemented. I have no idea how it's done in Calc or other
editors. We all know the dread of opening a large file in a spreadsheet
editor. But then again, maybe that's when the data should be moved into
a database rather than stay in a heavy spreadsheet.
> https://github.com/apache/parquet-format
> The best place to learn about the specifics of this file format
Yes that's it. I don't want to sound self-contradictory, but maybe it's
NOT a good idea to support Parquet. I was just bringing it up, and maybe
this needs some more thought about the degree of usefulness or if people
will actually use it. Chicken-and-egg problem?
Links:
[1]
https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
Shadi Akiki
Founder & CEO, AutofitCloud
https://autofitcloud.com/
+1 813 579 4935
On 11/22/19 2:42 PM, Kohei Yoshida wrote:
> On 22.11.2019 02:37, Shadi Akiki wrote:
>
>> I'm wondering why Parquet is not yet a supported format in LibreOffice
>> Calc (and most desktop worksheet processing tools for that matter).
>
> Well, one reason may be that nobody had asked for it yet! On that
> note, asking about it and raising awareness (which you did) is a
> necessary first step.
>
> Also, it would be nice to know the benefits that this format brings
> that any other existing formats currently do not. I use pandas
> occasionally and I do work with people who use it on a regular basis,
> but I had not heard this file format mentioned in our conversations to
> this day.
>
> Is this page
>
> https://github.com/apache/parquet-format
>
> The best place to learn about the specifics of this file format, or is
> there any other page that provides more details?
>
> One way we can add support for a new file format such as this one to
> Calc is to add it to the orcus library [1], which Calc uses internally
> to handle a subset of file formats. That may potentially be a much
> easier route than adding it to the LibreOffice code base directly...
> Full disclosure: I do maintain this library.
>
> Kohei
>
> [1] https://gitlab.com/orcus/orcus
>
More information about the LibreOffice
mailing list