Fw: benchmark of Excel, Calc, Google Docs

Mon Dec 9 19:14:46 UTC 2019

Oops, for some reason I forgot to add Sajjadur to the email list!

Aditya

On Sun, Dec 8, 2019 at 3:14 PM Aditya Parameswaran <adityagp at berkeley.edu>
wrote:

> Michael,
>
> Great to connect with you.  Sajjadur forwarded your email to me.
>
> Just a bit of context about the work that you're referring to --
>
> I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along
> with Prof. Karrie Karahalios at the University of Illinois and many Ph.D.
> student researchers, we've been working on developing a scalable
> spreadsheet system, DataSpread (http://dataspread.github.io), for about
> half a decade now. We've published a number of papers on this topic as well
> as developed a prototype system.  Sajjadur led the effort on benchmarking
> that you referenced in your email and a navigation plugin (more on that in
> a bit), while Mangesh (also cc-ed, when he was at the University of
> Illinois) led the efforts to integrate spreadsheets and databases and to
> build a more interactive (asynchronous) formula execution engine when he
> was at Illinois.
>
> We'd be very keen to collaborate to see if some of the ideas that we've
> developed and opportunities we've identified would make sense in Calc.  Our
> ultimate aim is to percolate some of these ideas back into popular
> spreadsheet systems like Calc, so I'm excited to have this opportunity.
>
> Answers to some of your questions in-line below:
>
>>
>>         I was interested in a number of things: particularly whether we
>> can get
>> your test sheets / macros so that we can run the tests under a profiler
>> & of course see what stupid things jump out that we can fix =)
>>
>
> Yes, of course. Sajjadur, with Kelly's help, is looking into packaging
> this and sending it your way.
>
>>
>>         We do have a columnular layout; checkout:
>>
>>
>> https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111
>>
>> ... assuming you have reasonably uniform, contiguous data types down a
>> column your test shouldn't have concluded:
>>
>>         "Takeaway: Spreadsheet systems do not employ a columnar
>> data layout to improve computational (e.g., aggregation) performance"
>>
>
> Looking at Figure 10 in our preprint,  Calc does seem to be somewhat
> faster for sequential read than random read at least on larger sizes.
>
> So I am not sure why we concluded outright that none of the spreadsheet
> systems employ a columnar layout -- this is a good catch; we will fix.
>
> That said, looking at Figure 10, it is surprising that the gains for the
> sequential read are not a lot more;  and the gains should increase
> proportionally.  So something funky is going on. Worth investigating.
>
>
>
>>         For incremental updates - re-computation is frequently chosen
>> over more
>> smarts  since correctness is a far more dominating concern than
>> performance generally, and there are plenty of know performance
>> optimizations that can be done before we try to complicate things
>> further. Also the assumption that after deleting A100:
>>
>>         SUM(A1:A100) - A100 === SUM(A1:A99)
>>
>>         falls foul of potential precision problems.
>>
>
> Makes sense.
>
>
>>
>>         The idea of converting to SQL queries is an interesting one but I
>> find
>> it very hard to believe it would provide any performance advantage at
>> the same memory footprint. Furthermore - I'd be interested to know how
>> you do other spreadsheet operations: row & column insertion, addressing,
>> and dependency work on top of a SQL database with any efficiency.
>>
>
> We started by having the relational database be a simple persistent
> storage layer, when coupled with an index to retrieve data by position, can
> allow us to scroll through large datasets of billions of rows at ease. We
> developed a new positional index to handle insertions and deletions in
> O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree that pushing
> the computation to the relational database does have overheads; but at the
> same time, it allows for scaling to arbitrarily large datasets.
>
> Happy to explain this on the phone.
>
>
>>         It is also worth bearing in mind that dependency management and
>> tracking of what to update when something changes consumes a far larger
>> proportion of a spreadsheet than is commonly expected: what to
>> re-calculate having changed A1 is far from trivial for large, twisty
>> real-world sheets.
>>
>
> Ah, for this, we have a new asynchronous formula computation capability
> (led by Mangesh) that looks something like this:
> https://twitter.com/adityagp/status/1146105708024229888?s=20 (gif in
> tweet).  When a change is made, we use a compact dependency maintenance
> structure to identify dependent cells that are hidden away with a progress
> bar, and then compute them in the background.
>
>>
>>         Anyhow - would be happy to have a chat with your team at some
>> stage if
>> you're interested in helping us to improving things here: a good start
>> would be just getting more representative benchmarks for the workload
>> you're interested in would be really useful.
>>
>
> Would love to chat and see if any of the work that we're doing can
> translate into Calc, and how we can contribute.
>
> One other project that may be of interest is one where we're trying to
> build a spreadsheet summarization and navigation tool, which can be
> especially helpful on very large spreadsheets.
> http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf
>
>
>>
>>         And finally of course, you used a really old version. I'd
>> recommend
>> using the 6.4 snapshots, or 6.3 if you must.
>>
>>
>> https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3
>>
>> https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe
>> ...
>>
>
> Agreed. We started the benchmarking effort a couple years ago, and the old
> version was the new version back then :-)
>
>
>
>>         And lots of others. Indeed, we have a customer who has large
>> ~100k+ row
>> spreadsheets with complex sorts across many rows who claims that Calc
>> sorts the data in second to minutes vs. Excels' hours - so I was
>> surprised to see your sort results; would be good to inspect what you do.
>>
>         I hope that helps, would love to get in touch with you & your team,
>> there is plenty of fun stuff to do here to make things quicker & better =)
>>
>
> Again, happy to share what we know!  Let's find a time to chat.  I see
> that you're in Europe, so mornings for us (PT/CT) may work better?
> Sajjadur is traveling, so I'm not entirely sure if he's around, but I
> should be able to find time to chat early in the morning any day next week.
>
> Aditya
>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20191209/1d395308/attachment.html>