[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Mon Jan 31 12:36:51 PST 2011

Hi Michael

On 1 February 2011 01:17, Michael Meeks <michael.meeks at novell.com> wrote:
> Hi Steve,

>        Sure - so; in response to user input I suspect we can take a second to
> parse the thesaurus; we have around 20Mb of text to load for en_US;
> perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
> quickly.

Where it will hurt is if it is not in cache and the user has some
background task running that hits the disk.

An example might be on Windows with virus scanning (or viruses :) ).

>        Right. I think we could easily serialize a small skip-list to disk too
> - if we simply store ~8 or ~32 or so indexes into the data - we can
> parse only a fraction of it, and pop that in our home directory. We
> could also drop the MyThes code too as a depedency to manage.

I'm not sure what you mean by a skip list unless you simply mean a
similar file to the existing .idx, or just a list of offsets for where
the words are to skip loading the whole file.  The trouble with that
approach is the readahead will likely pull in the whole file anyway as
the words aren't generally _that_ far apart in it, so you'll still do
all the IO and just skip a bit of the CPU time.

>
>        The code using it is in:
>
>        lingucomponent/source/thesaurus/libnth/nthesimp.cxx
>
>> BTW, if I did that I'd probably do some major surgery on mythes and
>> just use STL because it basically is doing C style memory management
>> and processing and I think I would screw it up if I started messing
>> with it.  The only problem with simplifying it with STL constructs is
>> that I would want to change the interface (string vs char *), maybe
>> use STL vectors for the list of synonyms, etc.
>
>        Heh; sure.

I've cooled off on this a bit as performance is slower when using lots
of strings etc.  I was able to change the approach to loading the idx
to treat it as a big buffer and sped it up considerably too.  This did
mean resorting to lots of pointer tomfoolery but it is easy to cleanup
as there are only 3 allocations instead of 100k+ worth.

>        I guess we could re-write it inside lingucomponent then (?) but we
> should prolly get a better understanding of how frequently this code is
> called first - is it hooked into from the spell checking code ? or is it
> really just the Tools->Language->Thesaurus ?

It's actually hooked into the right click menu (probably amongst other
things).  The first time you right click on a word, the dictionary for
the current locale is loaded before the right click menu shows up.
After that, it uses the cached thesaurus dictionary for subsequent
lookups.

If you look in your right-click menu, you'll notice a thesaurus list
of synonyms shows up (assuming the word is found) :).

Regards,
Steven Butler