libebook filter sniffing cost ...

Thu Dec 12 02:02:57 PST 2013

Hi David & Fridrich,

	Just doing some load time profiling, and I notice that the libebook
filter chews just under 3% of the load-time of (quite a large) XLSX
file ;-)

	It seems the filter / sniffing / detection code there is particularly
problematic. I wonder if we need something like this:

	git log -u -1 53138c9968e28a25a8cd6d2b5e3d31cbb3257852

	To avoid thrashing the XStream read function ? we do 52k 'read' calls
on the XStream which is really not a fast interface to use for small
reads.

	http://people.freedesktop.org/~michael/sheet-profile.txt

	Has the profile there; compare EBookImportFilter::detect to
framework::LoadEnv::startLoading.

	For thumbnailing we had a similar problem with reading strings improved
but not fixed by:

commit d67cd21033877c9c09d9cc4f14c2c4658e973f57
Author: Mathieu Parent <mathieu.parent at nantesmetropole.fr>
Date:   Mon Oct 14 22:23:05 2013 +0100

    fdo#56007 - Read more bytes on Zip read (for thumbnails)

	Particularly on remote file-systems we'd do many remote calls here -
which is really not ideal.

	I've pushed a small patch to avoid some of the more silly reallocing
calling of:

template< class E >
inline void Sequence< E >::realloc( sal_Int32 nSize )
{
    const Type & rType = ::cppu::getTypeFavourUnsigned( this );
    sal_Bool success =
    ::uno_type_sequence_realloc(
        &_pSequence, rType.getTypeLibType(), nSize,
        (uno_AcquireFunc)cpp_acquire, (uno_ReleaseFunc)cpp_release );
    if (!success)
        throw ::std::bad_alloc();
}

	Un-conditionally even when the sequence is the same length seems
particularly silly ;-) [ I assume that the WPXSvInputStream by keeping
the sequence around should save that allocation & be quite efficient
through a blizzard of identical sized reads anyhow ;-].

	It makes me wonder whether the above should have a fast-past for
pointless reallocs to the same size though.

	Thoughts appreciated though; is there some ordering of sniffing such
that we can prioritize common formats over less common ones ? and has
perhaps libebook got into that stack too high up ?

	ATB,

		Michael.

-- 
 michael.meeks at collabora.com  <><, Pseudo Engineer, itinerant idiot