switching to XFastParser
Noel Grandin
noelgrandin at gmail.com
Thu Mar 31 12:11:52 UTC 2016
Hi
[Including the original off-list discussion below for context for anyone who cares]
So I took a look a Daniel Sikeler's branch at
https://cgit.freedesktop.org/libreoffice/core/log/?h=feature/fastparser
and it looks like he did a pretty thorough job of converting everything to XFastParser.
What was the reason this did not get merged?
Would it suffice to simply pull the commits out of this tree one-by-one, dust them off, pretty them up, verify them
through 'make check' and push them to master?
Regards, Noel
On 2016/02/29 12:36 PM, Michael Meeks wrote:
> Hi Noel,
>
> This belongs CC'd to the dev. list; please do fwd it there to contine
> the discussion =)
>
> On Sun, 2016-02-28 at 09:05 +0200, Noel Grandin wrote:
>> When you guys did the SAX parsing improvements (XFastParser2), why did
>> we maintain the UNO API?
>
> Is there an XFastParser2 API ?
>
>> Why not use libxml/expat directly ?
>
> The libxml2 API (the faster parser) is horrendous - the XFastParser API
> is at least a tokenized API - which is essentially what we want the code
> to consume; ultimately we want to patch libxml2 some more as well to
> improve load performance - removing some of the more stupid pieces;
> quite possibly we also want to implement an even faster compressed XML
> parsing scheme I have up my sleeve behind that API.
>
> We did short-circuit UNO for the tokenization piece - which saved a
> huge chunk of time, and profiled it rather intensively. Last I looked, I
> saw no significant performance cost from the UNO interface.
>
> Finally - the libxml2 and expat APIs are (like most SAX APIs)
> synchronous, and same-thread; a big part of our load-time speed win
> comes from doing the XML parse + tokenize in another thread, and
> emitting the events in the main thread [ cf. slide decks at several
> LibreOffice conferences on the topic ].
>
> ie. nothing to 'fix' there =)
>
>> I'm assuming there is something I'm missing?
>
> Depends what you're trying to achieve =) if you want to improve
> performance and cleanliness -by-far- the most useful thing remaining to
> be done there is to switch the ODF filters in xmloff/ to use the
> FastParser API - currently they do tokenization themselves in a horribly
> inefficient way; and of course they don't take advantage of the threaded
> parsing etc.
>
> There was a Munich student (Daniel Sikeler) working on that -
> unfortunately with very little time for mentoring; so it may be a
> challenge to try to rescue that work. xmloff/ is quite big - and built
> on outside in the main components too. So - almost certainly by far the
> best way here is an incremental one.
>
> We need to write a good, clean XFastParser <-> XParser mapping, prolly
> that will require some love in sax/ some of the semantics don't map
> entirely perfectly in corner cases. I believe Daniel's branch is
> feature/fastparser - and you could rescue just this mapper from there I
> think.
>
> That would then allow the threaded processing & tokenization (we would
> need to de-tokenize again to the XParser interface but I think we would
> still get some nice wins ;-). When that works nicely - we need to
> connect the xmloff/ tokenization code to the XFastParser tokenized
> results to avoid doing all of that twice, and slowly and carefully push
> the interface change across the code to kill the XParser variant.
>
> At least - that would be my suggestion of something worthwhile & juicy
> to dig teeth into =) it is
>
> ATB,
>
> Michael.
>
More information about the LibreOffice
mailing list