tdf#109097 - mdds accelerating random lookups

Sat Aug 26 13:39:55 UTC 2017

On Thu, 2017-08-24 at 15:33 +0530, Dennis Francis wrote:
> Hello,
> 
> As this bug report (https://bugs.documentfoundation.org/show_bug.cgi?
> id=109097) suggests
> it is highly desirable to improve the lookup time in
> multi_type_vector (currently in worst case 
> it takes O(N)) without position hints. One way to improve the
> situation is to store absolute 
> position of first element of each block in the block instead of the
> block size member. This 
> can improve the lookup time up to log(log(N)) using linear
> interpolation search or at least log(N) 
> if we fallback to binary search after a few initial passes of linear-
> interpolation search.

Agreed.  That's certainly one way to improve the situation, and I'm
pretty sure that's the direction you guys are going (or hoping to go,
with our blessing).

But I have to point out that, the bug you are referencing may not be
the best one to reference to push your agenda, since that performance
issue described in the bug report can be solved by switching to using
position hints, and modifying the existing code that accesses and
updates the matrix content.  That's precisely what Markus was
suggesting and what we've been doing traditionally prior to this
conversation.

I'd rather use the argument that, the change you are proposing does
make sense from the point of view of making mdds::multi_type_vector
more friendly to concurrent read-only access from multiple threads.  In
that light, use of position hints may not be desirable since it would
incur additional costs due to the position hints being additional
states that we would need to keep and share between multiple threads.
That would require thread locking semantics, which you guys are
understandably trying to avoid.

> 
> Down side of this approach is that in the worst case, insert/deletes
> incur O(N) cost because
> absolute positions stored in each block subsequent to the
> delete/insert block have to be changed,

This downside is also correct.  We should also note that in this
scenario, N refers to the number of blocks, not the number of cells,
the former of which can be substantially lower in typical real-life
scenarios than the latter.  One good news is that, this position update
can potentially can concurrently if necessary since such operation only
modifies the size of one block at any given moment, and the delta size
can be applied to all the subsequent blocks independently.

I'm not suggesting we should make it multi-threaded from the get-go
(I'd rather you not do that, at least not without some benchmarking),
but when the worst comes to worst we could go that route.

> Before working on any change from my side, I would love to hear
> everyone's thoughts on the topic first. Thanks !

So, my short answer is "it makes sense".  There are some risks involved
especially with regard to the potential cost on insertions and
deletions which we just don't know ahead of time.  But my gut feeling
tells me that it'll probably be "fine for most common use cases".

But I have to warn you; making this change would not be a walk in the
park. :-)  multi_type_vector has tons of code branches to deal with all
sorts of situations, and you need to be prepared to tackle all of those
branches.  I hope you are mentally prepared for it. :-)

That's all I gotta say.  Please keep me in the loop.

Kohei