Fw: benchmark of Excel, Calc, Google Docs

Sun Dec 8 23:14:01 UTC 2019

Michael,

Great to connect with you.  Sajjadur forwarded your email to me.

Just a bit of context about the work that you're referring to --

I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along with
Prof. Karrie Karahalios at the University of Illinois and many Ph.D.
student researchers, we've been working on developing a scalable
spreadsheet system, DataSpread (http://dataspread.github.io), for about
half a decade now. We've published a number of papers on this topic as well
as developed a prototype system.  Sajjadur led the effort on benchmarking
that you referenced in your email and a navigation plugin (more on that in
a bit), while Mangesh (also cc-ed, when he was at the University of
Illinois) led the efforts to integrate spreadsheets and databases and to
build a more interactive (asynchronous) formula execution engine when he
was at Illinois.

We'd be very keen to collaborate to see if some of the ideas that we've
developed and opportunities we've identified would make sense in Calc.  Our
ultimate aim is to percolate some of these ideas back into popular
spreadsheet systems like Calc, so I'm excited to have this opportunity.

Answers to some of your questions in-line below:

>
>         I was interested in a number of things: particularly whether we
> can get
> your test sheets / macros so that we can run the tests under a profiler
> & of course see what stupid things jump out that we can fix =)
>

Yes, of course. Sajjadur, with Kelly's help, is looking into packaging this
and sending it your way.

>
>         We do have a columnular layout; checkout:
>
>
> https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111
>
> ... assuming you have reasonably uniform, contiguous data types down a
> column your test shouldn't have concluded:
>
>         "Takeaway: Spreadsheet systems do not employ a columnar
> data layout to improve computational (e.g., aggregation) performance"
>

Looking at Figure 10 in our preprint,  Calc does seem to be somewhat faster
for sequential read than random read at least on larger sizes.

So I am not sure why we concluded outright that none of the spreadsheet
systems employ a columnar layout -- this is a good catch; we will fix.

That said, looking at Figure 10, it is surprising that the gains for the
sequential read are not a lot more;  and the gains should increase
proportionally.  So something funky is going on. Worth investigating.

>         For incremental updates - re-computation is frequently chosen over
> more
> smarts  since correctness is a far more dominating concern than
> performance generally, and there are plenty of know performance
> optimizations that can be done before we try to complicate things
> further. Also the assumption that after deleting A100:
>
>         SUM(A1:A100) - A100 === SUM(A1:A99)
>
>         falls foul of potential precision problems.
>

Makes sense.

>
>         The idea of converting to SQL queries is an interesting one but I
> find
> it very hard to believe it would provide any performance advantage at
> the same memory footprint. Furthermore - I'd be interested to know how
> you do other spreadsheet operations: row & column insertion, addressing,
> and dependency work on top of a SQL database with any efficiency.
>

We started by having the relational database be a simple persistent storage
layer, when coupled with an index to retrieve data by position, can allow
us to scroll through large datasets of billions of rows at ease. We
developed a new positional index to handle insertions and deletions in
O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree that pushing the
computation to the relational database does have overheads; but at the same
time, it allows for scaling to arbitrarily large datasets.

Happy to explain this on the phone.

>         It is also worth bearing in mind that dependency management and
> tracking of what to update when something changes consumes a far larger
> proportion of a spreadsheet than is commonly expected: what to
> re-calculate having changed A1 is far from trivial for large, twisty
> real-world sheets.
>

Ah, for this, we have a new asynchronous formula computation capability
(led by Mangesh) that looks something like this:
https://twitter.com/adityagp/status/1146105708024229888?s=20 (gif in
tweet).  When a change is made, we use a compact dependency maintenance
structure to identify dependent cells that are hidden away with a progress
bar, and then compute them in the background.

>
>         Anyhow - would be happy to have a chat with your team at some
> stage if
> you're interested in helping us to improving things here: a good start
> would be just getting more representative benchmarks for the workload
> you're interested in would be really useful.
>

Would love to chat and see if any of the work that we're doing can
translate into Calc, and how we can contribute.

One other project that may be of interest is one where we're trying to
build a spreadsheet summarization and navigation tool, which can be
especially helpful on very large spreadsheets.
http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf

>
>         And finally of course, you used a really old version. I'd recommend
> using the 6.4 snapshots, or 6.3 if you must.
>
>
> https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3
>
> https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe
> ...
>

Agreed. We started the benchmarking effort a couple years ago, and the old
version was the new version back then :-)

>         And lots of others. Indeed, we have a customer who has large
> ~100k+ row
> spreadsheets with complex sorts across many rows who claims that Calc
> sorts the data in second to minutes vs. Excels' hours - so I was
> surprised to see your sort results; would be good to inspect what you do.
>
        I hope that helps, would love to get in touch with you & your team,
> there is plenty of fun stuff to do here to make things quicker & better =)
>

Again, happy to share what we know!  Let's find a time to chat.  I see that
you're in Europe, so mornings for us (PT/CT) may work better?  Sajjadur is
traveling, so I'm not entirely sure if he's around, but I should be able to
find time to chat early in the morning any day next week.

Aditya

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20191208/02dfc4de/attachment.html>