[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
michael.meeks at novell.com
Mon Jan 31 07:17:47 PST 2011
On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote:
> I haven't had a look at this yet as I thought getting a script to
> analyze the existing thesaurus files would be helpful to get those
> errors looked at.
Nice work with that :-)
> I thought I would discuss your idea about not using the index at all
> to see what reception it gets, but I think you may also have been
> suggesting a similar thing: are the index files even useful on modern gear?
I suspect the index files are mostly useless (personally).
> I can populate the en_US index in memory from the .dat file with the
> C++ code in 0.287 s after dropping all cache, and 0.188s when the
> cache is hot.
Sure - so; in response to user input I suspect we can take a second to
parse the thesaurus; we have around 20Mb of text to load for en_US;
perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
> I do admit that my desktop is pretty quick though, with 4 cores, SATA
> II drives etc.
Sure - but it will only use one of these ;-)
> If the thesaurus is only loaded when the user pops it up, then
> couldn't mythes be taught to generate its own in-memory index
> from the dictionary and not bother with an index file at all?
Right. I think we could easily serialize a small skip-list to disk too
- if we simply store ~8 or ~32 or so indexes into the data - we can
parse only a fraction of it, and pop that in our home directory. We
could also drop the MyThes code too as a depedency to manage.
The code using it is in:
> BTW, if I did that I'd probably do some major surgery on mythes and
> just use STL because it basically is doing C style memory management
> and processing and I think I would screw it up if I started messing
> with it. The only problem with simplifying it with STL constructs is
> that I would want to change the interface (string vs char *), maybe
> use STL vectors for the list of synonyms, etc.
> By this stage it's not looking much like mythes anymore ...
I guess we could re-write it inside lingucomponent then (?) but we
should prolly get a better understanding of how frequently this code is
called first - is it hooked into from the spell checking code ? or is it
really just the Tools->Language->Thesaurus ?
michael.meeks at novell.com <><, Pseudo Engineer, itinerant idiot
More information about the LibreOffice