<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style> </head> <body dir="ltr"> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> Hello Michael,</div> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> Thanks a lot for reaching out and for providing detailed feedback on the report. All of these comments will help in further augmenting/updating our report. We are definitely interested in learning more about Calc and also sharing our ideas on DB+Spreadsheet integration. I am currently traveling and will send you a detailed response CCing the entire DataSpread team. We are also cleaning up the repo and the macros and will definitely share those ASAP. There are a number of valid concerns that have been raised and need further examination---thanks for pointing those out. <br> </div> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> <br> </div> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> We didn't even get a response from one of the other tools compared in the report when we reached out to them. So many thanks for taking our effort seriously and providing important details about Calc. <br> </div> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> <br> </div> <div style="font-family: "Times New Roman", Times, serif; font-size: 12pt; color: rgb(0, 0, 0);"> Regards<br> </div> <div id="Signature"> <div style="font-family:Tahoma; font-size:13px"> <div style="font-family:Tahoma; font-size:13px"> <div style="font-family:Tahoma; font-size:13px"><font size="2" face="Times New Roman">Sajjadur Rahman<br> PhD Candidate<br> CS@illinois<br> <a href="http://srahman7.web.engr.illinois.edu/" target="_blank">http://srahman7.web.engr.illinois.edu/</a></font></div> </div> </div> </div> <div id="appendonsend"></div> <hr style="display:inline-block;width:98%" tabindex="-1"> <div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Michael Meeks <michael.meeks@collabora.com><br> <b>Sent:</b> Saturday, December 7, 2019 12:05 PM<br> <b>To:</b> Rahman, Sajjadur <srahman7@illinois.edu><br> <b>Cc:</b> Kohei Yoshida <kohei@libreoffice.org>; Dennis Francis <dennis.francis@collabora.com>; Lubos Lunak <l.lunak@collabora.com>; libreoffice-dev <libreoffice@lists.freedesktop.org><br> <b>Subject:</b> benchmark of Excel, Calc, Google Docs</font> <div> </div> </div> <div class="BodyFragment"><font size="2"><span style="font-size:11pt;"> <div class="PlainText">Hi Sajjadur & team,<br> <br> Great to connect on twitter; I noticed:<br> <br> <a href="https://blog.acolyer.org/2019/12/06/benchmarking-spreadsheet-systemsand"> https://blog.acolyer.org/2019/12/06/benchmarking-spreadsheet-systemsand</a><br> your paper:<br> <a href="https://people.eecs.berkeley.edu/~adityagp/papers/spreadsheet_bench.pdf"> https://people.eecs.berkeley.edu/~adityagp/papers/spreadsheet_bench.pdf</a><br> <br> I was interested in a number of things: particularly whether we can get<br> your test sheets / macros so that we can run the tests under a profiler<br> & of course see what stupid things jump out that we can fix =)<br> <br> I'd also like to query: "5.2 In memory Data Layout"<br> <br> "For random data access, we randomly select a row and then get the<br> value of cell corresponding to column A within that row. We used three<br> different row ranges of the Value-only dataset: 100k, 300k, and 500k. If<br> spreadsheets use a columnar layout, the sequential access would be much<br> faster than random access due to cache locality."<br> <br> We do have a columnular layout; checkout:<br> <br> <a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111">https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111</a><br> <br> // Cell values.<br> sc::CellStoreType maCells;<br> <br> is an MDDS cell store - which is fairly well optimized for the test<br> you're doing here jumping down rows. We're missing a final re-work there<br> to make it log(log(N)) lookup with an interpolating search but ...<br> assuming you have reasonably uniform, contiguous data types down a<br> column your test shouldn't have concluded:<br> <br> "Therefore, none of the systems utilize any intelligent in-memory<br> layout to speed up data access.<br> Takeaway: Spreadsheet systems do not employ a columnar<br> data layout to improve computational (e.g., aggregation) performance"<br> <br> I fact we use our columnar model to do SSE optimization of long<br> columnar sums and optimize cache locality for other formulae etc.<br> <br> In contrast my expectation (from previous Excel binary data formats) is<br> that Excel employs a row-based data storage.<br> <br> Of course we build on MDDS - which provides some nice building blocks<br> for speadsheet re-use:<br> <br> <a href="https://gitlab.com/mdds/mdds">https://gitlab.com/mdds/mdds</a><br> <br> Your general comments on missing [Global] Common Sub-Expression<br> elimination in spreadsheets seem fair, and something that should be<br> looked at as we improve our representations.<br> <br> For incremental updates - re-computation is frequently chosen over more<br> smarts since correctness is a far more dominating concern than<br> performance generally, and there are plenty of know performance<br> optimizations that can be done before we try to complicate things<br> further. Also the assumption that after deleting A100:<br> <br> SUM(A1:A100) - A100 === SUM(A1:A99)<br> <br> falls foul of potential precision problems. Then again so does adding<br> the numbers in a different order even at FP64 (cf. OpenCL) - and of<br> course there are lower hanging fruit than this right now.<br> <br> The idea of converting to SQL queries is an interesting one but I find<br> it very hard to believe it would provide any performance advantage at<br> the same memory footprint. Furthermore - I'd be interested to know how<br> you do other spreadsheet operations: row & column insertion, addressing,<br> and dependency work on top of a SQL database with any efficiency.<br> <br> It is also worth bearing in mind that dependency management and<br> tracking of what to update when something changes consumes a far larger<br> proportion of a spreadsheet than is commonly expected: what to<br> re-calculate having changed A1 is far from trivial for large, twisty<br> real-world sheets.<br> <br> Furthermore another big chunk of spreadsheet authors time is spent<br> considering and handling all of the legacy mis-features such as what to<br> do when you add an error to a bool, or a string containing a number that<br> shows up in your sheet in an unwelcome way =)<br> <br> Anyhow - would be happy to have a chat with your team at some stage if<br> you're interested in helping us to improving things here: a good start<br> would be just getting more representative benchmarks for the workload<br> you're interested in would be really useful.<br> <br> And finally of course, you used a really old version. I'd recommend<br> using the 6.4 snapshots, or 6.3 if you must.<br> <br> <a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3">https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3</a><br> <a href="https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe">https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe</a><br> ...<br> <br> And lots of others. Indeed, we have a customer who has large ~100k+ row<br> spreadsheets with complex sorts across many rows who claims that Calc<br> sorts the data in second to minutes vs. Excels' hours - so I was<br> surprised to see your sort results; would be good to inspect what you do.<br> <br> I hope that helps, would love to get in touch with you & your team,<br> there is plenty of fun stuff to do here to make things quicker & better =)<br> <br> All the best,<br> <br> Michael.<br> <br> -- <br> michael.meeks@collabora.com <><, GM Collabora Productivity<br> Hangout: mejmeeks@gmail.com, Skype: mmeeks<br> (M) +44 7795 666 147 - timezone usually UK / Europe<br> </div> </span></font></div> </body> </html>