<div dir="ltr"><div dir="ltr">Oops, for some reason I forgot to add Sajjadur to the email list!<div><br></div><div>Aditya</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Dec 8, 2019 at 3:14 PM Aditya Parameswaran <<a href="mailto:adityagp@berkeley.edu">adityagp@berkeley.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Michael,<div><br></div><div>Great to connect with you. Sajjadur forwarded your email to me. </div><div><br></div><div>Just a bit of context about the work that you're referring to -- </div><div><br></div><div>I'm Aditya Parameswaran, an assistant professor at UC Berkeley. Along with Prof. Karrie Karahalios at the University of Illinois and many Ph.D. student researchers, we've been working on developing a scalable spreadsheet system, DataSpread (<a href="http://dataspread.github.io" target="_blank">http://dataspread.github.io</a>), for about half a decade now. We've published a number of papers on this topic as well as developed a prototype system. Sajjadur led the effort on benchmarking that you referenced in your email and a navigation plugin (more on that in a bit), while Mangesh (also cc-ed, when he was at the University of Illinois) led the efforts to integrate spreadsheets and databases and to build a more interactive (asynchronous) formula execution engine when he was at Illinois. </div><div><br></div><div>We'd be very keen to collaborate to see if some of the ideas that we've developed and opportunities we've identified would make sense in Calc. Our ultimate aim is to percolate some of these ideas back into popular spreadsheet systems like Calc, so I'm excited to have this opportunity.</div><div><br></div><div>Answers to some of your questions in-line below:</div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div><br>
I was interested in a number of things: particularly whether we can get<br>
your test sheets / macros so that we can run the tests under a profiler<br>
& of course see what stupid things jump out that we can fix =)<br></div></span></font></div></div></div></blockquote><div><br></div><div>Yes, of course. Sajjadur, with Kelly's help, is looking into packaging this and sending it your way.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div><br>
We do have a columnular layout; checkout:<br>
<br>
<a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111" target="_blank">https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111</a><br><br>... assuming you have reasonably uniform, contiguous data types down a<br>
column your test shouldn't have concluded:<br>
<br>
"Takeaway: Spreadsheet systems do not employ a columnar<br>
data layout to improve computational (e.g., aggregation) performance"<br></div></span></font></div></div></div></blockquote><div><br></div><div>Looking at Figure 10 in our preprint, Calc does seem to be somewhat faster for sequential read than random read at least on larger sizes. </div><div><br></div><div>So I am not sure why we concluded outright that none of the spreadsheet systems employ a columnar layout -- this is a good catch; we will fix.</div><div><br></div><div>That said, looking at Figure 10, it is surprising that the gains for the sequential read are not a lot more; and the gains should increase proportionally. So something funky is going on. Worth investigating. </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
For incremental updates - re-computation is frequently chosen over more<br>
smarts since correctness is a far more dominating concern than<br>
performance generally, and there are plenty of know performance<br>
optimizations that can be done before we try to complicate things<br>
further. Also the assumption that after deleting A100:<br>
<br>
SUM(A1:A100) - A100 === SUM(A1:A99)<br>
<br>
falls foul of potential precision problems. <br></div></span></font></div></div></div></blockquote><div><br></div><div>Makes sense. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
<br>
The idea of converting to SQL queries is an interesting one but I find<br>
it very hard to believe it would provide any performance advantage at<br>
the same memory footprint. Furthermore - I'd be interested to know how<br>
you do other spreadsheet operations: row & column insertion, addressing,<br>
and dependency work on top of a SQL database with any efficiency.<br></div></span></font></div></div></div></blockquote><div><br></div><div>We started by having the relational database be a simple persistent storage layer, when coupled with an index to retrieve data by position, can allow us to scroll through large datasets of billions of rows at ease. We developed a new positional index to handle insertions and deletions in O(log(n)) -- <a href="https://arxiv.org/pdf/1708.06712.pdf" target="_blank">https://arxiv.org/pdf/1708.06712.pdf</a>. I agree that pushing the computation to the relational database does have overheads; but at the same time, it allows for scaling to arbitrarily large datasets. </div><div><br></div><div>Happy to explain this on the phone. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
It is also worth bearing in mind that dependency management and<br>
tracking of what to update when something changes consumes a far larger<br>
proportion of a spreadsheet than is commonly expected: what to<br>
re-calculate having changed A1 is far from trivial for large, twisty<br>
real-world sheets.<br></div></span></font></div></div></div></blockquote><div><br></div><div>Ah, for this, we have a new asynchronous formula computation capability (led by Mangesh) that looks something like this: <a href="https://twitter.com/adityagp/status/1146105708024229888?s=20" target="_blank">https://twitter.com/adityagp/status/1146105708024229888?s=20</a> (gif in tweet). When a change is made, we use a compact dependency maintenance structure to identify dependent cells that are hidden away with a progress bar, and then compute them in the background. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
<br>
Anyhow - would be happy to have a chat with your team at some stage if<br>
you're interested in helping us to improving things here: a good start<br>
would be just getting more representative benchmarks for the workload<br>
you're interested in would be really useful.<br></div></span></font></div></div></div></blockquote><div><br></div><div>Would love to chat and see if any of the work that we're doing can translate into Calc, and how we can contribute. </div><div><br></div><div>One other project that may be of interest is one where we're trying to build a spreadsheet summarization and navigation tool, which can be especially helpful on very large spreadsheets. <a href="http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf" target="_blank">http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf</a></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
<br>
And finally of course, you used a really old version. I'd recommend<br>
using the 6.4 snapshots, or 6.3 if you must.<br>
<br>
<a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3" target="_blank">https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3</a><br>
<a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe" target="_blank">https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe</a><br>
...<br></div></span></font></div></div></div></blockquote><div> </div><div>Agreed. We started the benchmarking effort a couple years ago, and the old version was the new version back then :-) </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
And lots of others. Indeed, we have a customer who has large ~100k+ row<br>
spreadsheets with complex sorts across many rows who claims that Calc<br>
sorts the data in second to minutes vs. Excels' hours - so I was<br>
surprised to see your sort results; would be good to inspect what you do.</div></span></font></div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div>
I hope that helps, would love to get in touch with you & your team,<br>
there is plenty of fun stuff to do here to make things quicker & better =)<br></div></span></font></div></div></div></blockquote><div><br></div><div><div>Again, happy to share what we know! Let's find a time to chat. I see that you're in Europe, so mornings for us (PT/CT) may work better? Sajjadur is traveling, so I'm not entirely sure if he's around, but I should be able to find time to chat early in the morning any day next week. </div><div><br></div><div>Aditya</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-6227616310275270593m_-8937938351560396000m_-7277728804925398929gmail-m_3911941826362033687m_-5801946590732629049m_-375889121891063932m_-5223167876847649153gmail-m_202700241290530657Signature"><div><font size="2"><span style="font-size:11pt"><div></div></span></font></div></div></div></blockquote></div><div> </div></div></div>
</blockquote></div></div>