Fw: benchmark of Excel, Calc, Google Docs

Tue Dec 10 20:29:57 UTC 2019

Hi Aditya,

On 08/12/2019 23:14, Aditya Parameswaran wrote:
> I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along
> with Prof. Karrie Karahalios at the University of Illinois and many
> Ph.D. student researchers, we've been working on developing a scalable
> spreadsheet system, DataSpread (http://dataspread.github.io), for about
> half a decade now.

	Interesting stuff.

> We'd be very keen to collaborate to see if some of the ideas that we've
> developed and opportunities we've identified would make sense in Calc.

	Sounds good. Very busy this week, but would yo be up for a conference
(with whomever is interested) sometime in the evening (UK time) of the
17th or 19th ? We could use https://meet.jit.si/CalcChat eg.

> Our ultimate aim is to percolate some of these ideas back into popular
> spreadsheet systems like Calc, so I'm excited to have this opportunity.

	Great. Some good ideas to include there, only a chunk of typing is
required =)

> Yes, of course. Sajjadur, with Kelly's help, is looking into packaging
> this and sending it your way.

	Excellent; thanks.

> So I am not sure why we concluded outright that none of the spreadsheet
> systems employ a columnar layout -- this is a good catch; we will fix.

	=)

> That said, looking at Figure 10, it is surprising that the gains for the
> sequential read are not a lot more;  and the gains should increase
> proportionally.  So something funky is going on. Worth investigating. 

	Ah - well ... so ;-) as I said it depends on your data-set, and its
type homogeneity down the column to a degree, and also we can improve
our lookup algorithm there.

> We started by having the relational database be a simple persistent
> storage layer, when coupled with an index to retrieve data by position,
> can allow us to scroll through large datasets of billions of rows at
> ease. We developed a new positional index to handle insertions and
> deletions in O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree
> that pushing the computation to the relational database does have
> overheads; but at the same time, it allows for scaling to arbitrarily
> large datasets. 

	Ooh - nice paper. Your crawled data-set looks quite interesting too, we
run wide-scale crash-testing on the LibreOffice code-base across ~100k
files and enlarging our corpus there: or better, getting some
statistical view of which OOXML attributes (and thus features) are most
used out there would be extremely useful to us as we develop the core.

	I like the data on spreadsheet and formula shape - that is very useful.
Do you have data on the geometry of formulae - as in rows vs. columns ?
[ we switched to columnar storage based mostly on experience rather than
hard data ;-].

	It is also interesting to have access to very large (1.3m row)
data-sets that can have useful analysis done on them - would love to see
the source data there.

> Would love to chat and see if any of the work that we're doing can
> translate into Calc, and how we can contribute. 

	Great.

> One other project that may be of interest is one where we're trying to
> build a spreadsheet summarization and navigation tool, which can be
> especially helpful on very large
> spreadsheets.  http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf

	Sounds good too. Of course, most useful on thee huge corpus of existing
sheets out there in XLS[X] / ODS format.

> Agreed. We started the benchmarking effort a couple years ago, and the
> old version was the new version back then :-) 

	Heh ;-)

> Again, happy to share what we know!  Let's find a time to chat.  I see
> that you're in Europe, so mornings for us (PT/CT) may work better? 
> Sajjadur is traveling, so I'm not entirely sure if he's around, but I
> should be able to find time to chat early in the morning any day next week. 

	Sounds good, cf. above - if we can't make that - early in the new year
would be great.

	I look forward to talking,

		Michael.

-- 
michael.meeks at collabora.com <><, GM Collabora Productivity
Hangout: mejmeeks at gmail.com, Skype: mmeeks
(M) +44 7795 666 147 - timezone usually UK / Europe