benchmark of Excel, Calc, Google Docs

Sat Dec 7 18:44:23 UTC 2019

Hello Michael,
Thanks a lot for reaching out and for providing detailed feedback on the report. All of these comments will help in further augmenting/updating our report. We are definitely interested in learning more about Calc and also sharing our ideas on DB+Spreadsheet integration. I am currently traveling and will send you a detailed response CCing the entire DataSpread team. We are also cleaning up the repo and the macros and will definitely share those ASAP. There are a number of valid concerns that have been raised and need further examination---thanks for pointing those out.

We didn't even get a response from one of the other tools compared in the report when we reached out to them. So many thanks for taking our effort seriously and providing important details about Calc.

Regards
Sajjadur Rahman
PhD Candidate
CS at illinois
http://srahman7.web.engr.illinois.edu/
________________________________
From: Michael Meeks <michael.meeks at collabora.com>
Sent: Saturday, December 7, 2019 12:05 PM
To: Rahman, Sajjadur <srahman7 at illinois.edu>
Cc: Kohei Yoshida <kohei at libreoffice.org>; Dennis Francis <dennis.francis at collabora.com>; Lubos Lunak <l.lunak at collabora.com>; libreoffice-dev <libreoffice at lists.freedesktop.org>
Subject: benchmark of Excel, Calc, Google Docs

Hi Sajjadur & team,

        Great to connect on twitter; I noticed:

        https://blog.acolyer.org/2019/12/06/benchmarking-spreadsheet-systemsand
your paper:
        https://people.eecs.berkeley.edu/~adityagp/papers/spreadsheet_bench.pdf

        I was interested in a number of things: particularly whether we can get
your test sheets / macros so that we can run the tests under a profiler
& of course see what stupid things jump out that we can fix =)

        I'd also like to query: "5.2 In memory Data Layout"

        "For random data access, we randomly select a row and then get the
value of cell corresponding to column A within that row. We used three
different row ranges of the Value-only dataset: 100k, 300k, and 500k. If
spreadsheets use a columnar layout, the sequential access would be much
faster than random access due to cache locality."

        We do have a columnular layout; checkout:

https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111

    // Cell values.
    sc::CellStoreType maCells;

        is an MDDS cell store - which is fairly well optimized for the test
you're doing here jumping down rows. We're missing a final re-work there
to make it log(log(N)) lookup with an interpolating search but ...
assuming you have reasonably uniform, contiguous data types down a
column your test shouldn't have concluded:

        "Therefore, none of the systems utilize any intelligent in-memory
layout to speed up data access.
Takeaway: Spreadsheet systems do not employ a columnar
data layout to improve computational (e.g., aggregation) performance"

        I fact we use our columnar model to do SSE optimization of long
columnar sums and optimize cache locality for other formulae etc.

        In contrast my expectation (from previous Excel binary data formats) is
that Excel employs a row-based data storage.

        Of course we build on MDDS - which provides some nice building blocks
for speadsheet re-use:

        https://gitlab.com/mdds/mdds

        Your general comments on missing [Global] Common Sub-Expression
elimination in spreadsheets seem fair, and something that should be
looked at as we improve our representations.

        For incremental updates - re-computation is frequently chosen over more
smarts  since correctness is a far more dominating concern than
performance generally, and there are plenty of know performance
optimizations that can be done before we try to complicate things
further. Also the assumption that after deleting A100:

        SUM(A1:A100) - A100 === SUM(A1:A99)

        falls foul of potential precision problems. Then again so does adding
the numbers in a different order even at FP64 (cf. OpenCL) - and of
course there are lower hanging fruit than this right now.

        The idea of converting to SQL queries is an interesting one but I find
it very hard to believe it would provide any performance advantage at
the same memory footprint. Furthermore - I'd be interested to know how
you do other spreadsheet operations: row & column insertion, addressing,
and dependency work on top of a SQL database with any efficiency.

        It is also worth bearing in mind that dependency management and
tracking of what to update when something changes consumes a far larger
proportion of a spreadsheet than is commonly expected: what to
re-calculate having changed A1 is far from trivial for large, twisty
real-world sheets.

        Furthermore another big chunk of spreadsheet authors time is spent
considering and handling all of the legacy mis-features such as what to
do when you add an error to a bool, or a string containing a number that
shows up in your sheet in an unwelcome way =)

        Anyhow - would be happy to have a chat with your team at some stage if
you're interested in helping us to improving things here: a good start
would be just getting more representative benchmarks for the workload
you're interested in would be really useful.

        And finally of course, you used a really old version. I'd recommend
using the 6.4 snapshots, or 6.3 if you must.

https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3
https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe
...

        And lots of others. Indeed, we have a customer who has large ~100k+ row
spreadsheets with complex sorts across many rows who claims that Calc
sorts the data in second to minutes vs. Excels' hours - so I was
surprised to see your sort results; would be good to inspect what you do.

        I hope that helps, would love to get in touch with you & your team,
there is plenty of fun stuff to do here to make things quicker & better =)

        All the best,

                Michael.

--
michael.meeks at collabora.com <><, GM Collabora Productivity
Hangout: mejmeeks at gmail.com, Skype: mmeeks
(M) +44 7795 666 147 - timezone usually UK / Europe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20191207/0e5fe44c/attachment.html>