[REVIEW 3-5] fdo#47644 performance regression on largish .doc

Thu May 10 08:06:18 PDT 2012

fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
of the extra checks I added to sanity test .doc file during parsing.
Turns out that seeking backwards in the those ole2 formats is incredibly
slow, this might affect other formats like xls and ppt that use some of
the shared "msfilter" code.

1st attachment gets us back to where we came from in terms of
performance there.

The second one is maybe a bit more contentious for 3-5, but I include it
for a look-over. Basically there's a "page chain" in the ole2 format and
to seek to a location you walk through the chain until you get to the
correct page. I see that in practice most documents have their chain in
a presorted order (I have no idea if having them unsorted is actually a
sign of a broken document or if its legal) so we can do a far faster
binary_search if we walk the chain once and hold on to the results.

So second attachment keeps the chain if its one of the circumstances
where we have to parse the whole thing anyway, and if not (the usual
one) then if we're seeking an (fairly arbitrary) large number of steps
where its likely that the cost-benefit is in favour of it then generate
the chain once and hold on to it for reuse until there any event which
might invalidate the chain.

Second patch knocks about 50 seconds off load time. First patch knocks
some unknown but > 10 mins off load time. 

C.

Don't even think about measuring on a gcc dbgutil build btw, the concept
checking of the stl lower_bound is a huge cost so the times aren't even
roughly uniformly slower on dbgutil vs product but spike dramatically on
some stl stuff.