[REVIEW 3-5] fdo#47644 performance regression on largish .doc

Thu May 10 09:37:16 PDT 2012

Hi there,

On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote:
> fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
> of the extra checks I added to sanity test .doc file during parsing.

	:-)

> 1st attachment gets us back to where we came from in terms of
> performance there.

	Great - I've pushed that to libreoffice-3-5.

> The second one is maybe a bit more contentious for 3-5, but I include it
> for a look-over. Basically there's a "page chain" in the ole2 format and
> to seek to a location you walk through the chain until you get to the
> correct page.

	Yep - it's a FAT structure; nasty one-slot linked list.

>  I see that in practice most documents have their chain in
> a presorted order (I have no idea if having them unsorted is actually a
> sign of a broken document or if its legal)

	It is legal, but it's highly unlikely.

> So second attachment keeps the chain if its one of the circumstances
> where we have to parse the whole thing anyway, and if not (the usual
> one) then if we're seeking an (fairly arbitrary) large number of steps
> where its likely that the cost-benefit is in favour of it then generate
> the chain once and hold on to it for reuse until there any event which
> might invalidate the chain.

	So - I'm a bit lost with the second patch. The FAT is essentially a
linked list; and we walk it to find the block we want, so:

	nRel = 3; -> go three steps along the linked list.
	A -> B -> C == C ...

	Your pageCache looks like the right approach, and turns this into a
simple vector we can rapidly look up:

	[0] = A, [1] = B, [2] = C

	And so on - hopefully I'm there so far :-) With the 512 byte block size
we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is
just fine).

> Second patch knocks about 50 seconds off load time. First patch knocks
> some unknown but > 10 mins off load time. 

	My question is thus about:

[snip]
        std::vector<sal_Int32>::iterator aI;

        if (m_bSortedPageChain)
            aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
        else
            aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);

        if (aI == m_aPagesCache.end())
        {
            SAL_WARN("sot", "Unknown page position");
            nBgn = STG_EOF;
        }
        else
        {
            size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);

            size_t nIndex = nBgnDistance + nRel;

            if (nIndex > m_aPagesCache.size())
            {
                nRel = m_aPagesCache.size() - nBgnDistance;
                nIndex = m_aPagesCache.size() - 1;
            }
            else
                nRel = 0;

            nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
            nBgn = m_aPagesCache[nIndex];
        }
[/snip]

	Surely that should read:

		nBgn = m_aPageCache[nRel];

	or did I go a bit crazy ? [ I'm trying to turn the above into that
mentally and giving up ;-].

	Otherwise the clearing of the cache looks sensible; though the:

	StgFATStrm::SetPage(); should prolly just update that vector as well as
doing the nasty fat walk so:

	m_aPageCache[nBlocks] = nNewPage;

	or somesuch (assuming it is long enough for that).

	Otherwise it looks really lovely.

	Thanks !

		Michael.

-- 
michael.meeks at suse.com  <><, Pseudo Engineer, itinerant idiot