switching to XFastParser

Thu Mar 31 12:11:52 UTC 2016

Hi

[Including the original off-list discussion below for context for anyone who cares]

So I took a look a Daniel Sikeler's branch at
    https://cgit.freedesktop.org/libreoffice/core/log/?h=feature/fastparser
and it looks like he did a pretty thorough job of converting everything to XFastParser.

What was the reason this did not get merged?

Would it suffice to simply pull the commits out of this tree one-by-one, dust them off, pretty them up, verify them 
through 'make check' and push them to master?

Regards, Noel

On 2016/02/29 12:36 PM, Michael Meeks wrote:
> Hi Noel,
>
> 	This belongs CC'd to the dev. list; please do fwd it there to contine
> the discussion =)
>
> On Sun, 2016-02-28 at 09:05 +0200, Noel Grandin wrote:
>> When you guys did the SAX parsing improvements (XFastParser2), why did
>> we maintain the UNO API?
>
> 	Is there an XFastParser2 API ?
>
>> Why not use libxml/expat directly ?
>
> 	The libxml2 API (the faster parser) is horrendous - the XFastParser API
> is at least a tokenized API - which is essentially what we want the code
> to consume; ultimately we want to patch libxml2 some more as well to
> improve load performance - removing some of the more stupid pieces;
> quite possibly we also want to implement an even faster compressed XML
> parsing scheme I have up my sleeve behind that API.
>
> 	We did short-circuit UNO for the tokenization piece - which saved a
> huge chunk of time, and profiled it rather intensively. Last I looked, I
> saw no significant performance cost from the UNO interface.
>
> 	Finally - the libxml2 and expat APIs are (like most SAX APIs)
> synchronous, and same-thread; a big part of our load-time speed win
> comes from doing the XML parse + tokenize in another thread, and
> emitting the events in the main thread [ cf. slide decks at several
> LibreOffice conferences on the topic ].
>
> 	ie. nothing to 'fix' there =)
>
>> I'm assuming there is something I'm missing?
>
> 	Depends what you're trying to achieve =) if you want to improve
> performance and cleanliness -by-far- the most useful thing remaining to
> be done there is to switch the ODF filters in xmloff/ to use the
> FastParser API - currently they do tokenization themselves in a horribly
> inefficient way; and of course they don't take advantage of the threaded
> parsing etc.
>
> 	There was a Munich student (Daniel Sikeler) working on that -
> unfortunately with very little time for mentoring; so it may be a
> challenge to try to rescue that work. xmloff/ is quite big - and built
> on outside in the main components too. So - almost certainly by far the
> best way here is an incremental one.
>
> 	We need to write a good, clean XFastParser <-> XParser mapping, prolly
> that will require some love in sax/ some of the semantics don't map
> entirely perfectly in corner cases. I believe Daniel's branch is
> feature/fastparser - and you could rescue just this mapper from there I
> think.
>
> 	That would then allow the threaded processing & tokenization (we would
> need to de-tokenize again to the XParser interface but I think we would
> still get some nice wins ;-). When that works nicely - we need to
> connect the xmloff/ tokenization code to the XFastParser tokenized
> results to avoid doing all of that twice, and slowly and carefully push
> the interface change across the code to kill the XParser variant.
>
> 	At least - that would be my suggestion of something worthwhile & juicy
> to dig teeth into =) it is
>
> 	ATB,
>
> 		Michael.
>