[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Michael Meeks michael.meeks at novell.com
Mon Jan 31 07:17:47 PST 2011


Hi Steve,

On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote:
> I haven't had a look at this yet as I thought getting a script to
> analyze the existing thesaurus files would be helpful to get those
> errors looked at.

	Nice work with that :-)

> I thought I would discuss your idea about not using the index at all
> to see what reception it gets, but I think you may also have been
> suggesting a similar thing: are the index files even useful on modern gear?

	I suspect the index files are mostly useless (personally).

> I can populate the en_US index in memory from the .dat file with the
> C++ code in 0.287 s after dropping all cache, and 0.188s when the
> cache is hot.

	Sure - so; in response to user input I suspect we can take a second to
parse the thesaurus; we have around 20Mb of text to load for en_US;
perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
quickly.

> I do admit that my desktop is pretty quick though, with 4 cores, SATA
> II drives etc.

	Sure - but it will only use one of these ;-)

> If the thesaurus is only loaded when the user pops it up, then
> couldn't mythes be taught to generate its own in-memory index
> from the dictionary and not bother with an index file at all?

	Right. I think we could easily serialize a small skip-list to disk too
- if we simply store ~8 or ~32 or so indexes into the data - we can
parse only a fraction of it, and pop that in our home directory. We
could also drop the MyThes code too as a depedency to manage.

	The code using it is in:

	lingucomponent/source/thesaurus/libnth/nthesimp.cxx

> BTW, if I did that I'd probably do some major surgery on mythes and
> just use STL because it basically is doing C style memory management
> and processing and I think I would screw it up if I started messing
> with it.  The only problem with simplifying it with STL constructs is
> that I would want to change the interface (string vs char *), maybe
> use STL vectors for the list of synonyms, etc.

	Heh; sure.

> By this stage it's not looking much like mythes anymore ...

	I guess we could re-write it inside lingucomponent then (?) but we
should prolly get a better understanding of how frequently this code is
called first - is it hooked into from the spell checking code ? or is it
really just the Tools->Language->Thesaurus ?

	Thanks !

		Michael.

-- 
 michael.meeks at novell.com  <><, Pseudo Engineer, itinerant idiot



More information about the LibreOffice mailing list