[REVIEW 3-5] fdo#47644 performance regression on largish .doc
Michael Meeks
michael.meeks at suse.com
Thu May 10 09:37:16 PDT 2012
Hi there,
On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote:
> fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
> of the extra checks I added to sanity test .doc file during parsing.
:-)
> 1st attachment gets us back to where we came from in terms of
> performance there.
Great - I've pushed that to libreoffice-3-5.
> The second one is maybe a bit more contentious for 3-5, but I include it
> for a look-over. Basically there's a "page chain" in the ole2 format and
> to seek to a location you walk through the chain until you get to the
> correct page.
Yep - it's a FAT structure; nasty one-slot linked list.
> I see that in practice most documents have their chain in
> a presorted order (I have no idea if having them unsorted is actually a
> sign of a broken document or if its legal)
It is legal, but it's highly unlikely.
> So second attachment keeps the chain if its one of the circumstances
> where we have to parse the whole thing anyway, and if not (the usual
> one) then if we're seeking an (fairly arbitrary) large number of steps
> where its likely that the cost-benefit is in favour of it then generate
> the chain once and hold on to it for reuse until there any event which
> might invalidate the chain.
So - I'm a bit lost with the second patch. The FAT is essentially a
linked list; and we walk it to find the block we want, so:
nRel = 3; -> go three steps along the linked list.
A -> B -> C == C ...
Your pageCache looks like the right approach, and turns this into a
simple vector we can rapidly look up:
[0] = A, [1] = B, [2] = C
And so on - hopefully I'm there so far :-) With the 512 byte block size
we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is
just fine).
> Second patch knocks about 50 seconds off load time. First patch knocks
> some unknown but > 10 mins off load time.
My question is thus about:
[snip]
std::vector<sal_Int32>::iterator aI;
if (m_bSortedPageChain)
aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
else
aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
if (aI == m_aPagesCache.end())
{
SAL_WARN("sot", "Unknown page position");
nBgn = STG_EOF;
}
else
{
size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);
size_t nIndex = nBgnDistance + nRel;
if (nIndex > m_aPagesCache.size())
{
nRel = m_aPagesCache.size() - nBgnDistance;
nIndex = m_aPagesCache.size() - 1;
}
else
nRel = 0;
nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
nBgn = m_aPagesCache[nIndex];
}
[/snip]
Surely that should read:
nBgn = m_aPageCache[nRel];
or did I go a bit crazy ? [ I'm trying to turn the above into that
mentally and giving up ;-].
Otherwise the clearing of the cache looks sensible; though the:
StgFATStrm::SetPage(); should prolly just update that vector as well as
doing the nasty fat walk so:
m_aPageCache[nBlocks] = nNewPage;
or somesuch (assuming it is long enough for that).
Otherwise it looks really lovely.
Thanks !
Michael.
--
michael.meeks at suse.com <><, Pseudo Engineer, itinerant idiot
More information about the LibreOffice
mailing list