[Libreoffice] Calc - a better design ...

Wed Oct 19 23:00:26 PDT 2011

Hi Michael,

I'm hesitant to ask this because I cannot personally promise time toward 
LO (only on an as-can basis, which is dismally small ATM), but hey, you 
can easily so no.  :-)  You mention "really need[ing] a whiteboard" to 
elaborate properly.  I submit that putting your thoughts together, 
perhaps in picture form and available on the LO wiki, or put together as 
a small video to Youtube, would be extremely useful to casual LO coders 
like myself.  As an individual volunteer without a face-to-face LO team 
member against whom to bounce ideas, I'd thoroughly love an 
actual-paid-engineer's thoughts on how best to proceed on this front.

I'm personally motivated for Calc because in my science career, I really 
have to bend over backwards to make Calc work effectively for my needs 
where Excel works just plain better/faster/smaller, yet I 
philosophically have stuck myself with Free software.  To me, one of the 
biggest areas of weakness for LO, after the various random crashes 
(which are getting better!), is the memory bloat, and speed.  It's not 
features.

I'm happy to mess with Ixion (and indeed have poked at it some already), 
but I imagine the going would be much more useful with a LO core 
developer's guiding thoughts, including technical end-goals and 
migration paths.

So, would you have the time to create a screencast or video of yourself 
in front of a whiteboard?

Cheers,

Kevin

At 9:42am -0400 Wed, 27 Jul 2011, Michael Meeks wrote:
> Hi Kevin,
>
> On Wed, 2011-07-27 at 06:56 -0400, Kevin Hunter wrote:
>> At 6:07am -0400 Wed, 27 Jul 2011, Michael Meeks wrote:
>>> but IMHO -the- fundamental design problem with calc, is something
>>> quite banal - the concept that a spreadsheet is built from cells:
>>> without breaking that basic misconception I don't think we can do
>>> any of the really interesting space / time optimisations we need to
>>> do.
>>
>> Can you elaborate a little on this fundamental design flaw ?
>
> 	Hah - so, yes - and not easily - I'd really need a whiteboard.
>
>>   As a naive and unfortunately focused elsewhere personality, I don't
>> immediately know of a better model for creating a spreadsheet.
>
> 	So - of course, the first thing to say is that (at a first pass), I'd
> want the UI to continue to look the same - all that would change would
> be the underlying model.
>
>>    Is it "just" a problem of sparsity?  Or is there a much more
>> sophisticated method for memory sharing of various similar cell
>> attributes, perhaps analogous to CSS?
>
> 	Here is the thing - since we work on a per-cell basis, all of our type
> checking, and formula adaptation, and storage, and dependencies etc. are
> all ultimately per-cell based. This (crudely) means that we have an O(n)
> where n is the number of cells in vast numbers of operations where we do
> not want it. It also tends to explode storage and computation time
> around dependencies - which is a substantial cost.
>
> 	NB. much of this is quite normal for a spreadsheet FWIW, I believe
> Excel is not completely dis-similar; however I think we can do better
> with much improved (much more complex / slow) data structure design that
> will ultimately be far faster to execute.
>
> 	Take a banal example; when we do a SUM() of a million rows containing
> plain doubles, we would want to (at root) have a function that [ in the
> ideal case ] does:
>
> 	double sum (double *array, sal_Int64 nItems);
>
> 	Which we can shove onto a gpu, multi-thread if nItems is big, etc. etc.
> unfortunately approaching this limit is basically impossible in calc.
> Instead for this case we would do a very slow, pointer-chasing iteration
> over a million scattered ScCell's - we would do type checking - to
> ensure that each one was a double, and only then would we add /
> accumulate them.
>
> 	Of course none of that can be pushed to a GPU, none of it can be SIMD
> accelerated, and threading it is rather hard.
>
> 	Now ... if we stored contiguous spans of identically typed items [ ie.
> a column of doubles ], as value runs in our data structures; currently
> we (amazingly) have arrays of (row/cell*) pairs incidentally. Then we
> could push a lot of branch-heavy, expensive type checking, and lots of
> pointer chasing outside the main-loop, we'd also use rather less memory.
>
> 	That's fine for doubles of course; but the really huge wins come from
> an entirely new model of dependencies and areas containing formulae - I
> propose only storing a base formula, and an iterator to describe how
> that formula changes down a row: to compress an entire column of
> formulae to a single formula. Futhermore on top of that substantially
> discarding the existing dependency mechanism such that a single-cell
> change in a contiguous range of doubles would have a shared broadcast on
> that whole range (with the specific detail of what cell changed), such
> that that could be turned into a specific row (or row range) to
> re-compute in any dependent by comparing the base formula, plus it's
> iterator with the range that changed ;-)
>
> 	So - in a nutshell, compressing and making explicit the vast amount of
> duplication and uniformity present in all big spreadsheets.
>
> 	IMNSHO if we can get to this world, we can kick Excel's backside in the
> performance and scalability arena, storage-wise we can become as tiny as
> we should be :-) and thus ideally we can also start to address far
> larger and more interesting data-sets.
>
> 	Anyhow - as text, this is not terribly convincing; drawing on a
> whiteboard would be more so, and with lots of working code - even more
> so ;-) I keep hoping to have time to play with ixion with Kohei in this
> regard.
>
> 	ATB,
>
> 		Michael.