[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Sun Jan 30 02:32:18 PST 2011

Hi Michael,

On 29 January 2011 21:45, Steve Butler <sebutler at gmail.com> wrote:

> I thought I would discuss your idea about not using the index at all
> to see what reception it gets, but I think you may also have been
> suggesting a similar thing:
> are the index files even useful on modern gear?
>
> I can populate the en_US index in memory from the .dat file with the
> C++ code in 0.287 s after dropping all cache, and 0.188s when the
> cache is hot.
>
> I do admit that my desktop is pretty quick though, with 4 cores, SATA
> II drives etc.

I have plugged the idxdict.cpp code (modified) into the mythes index
loader and made it load from the .dat file directly.  The index file
is no longer touched.

Here's some comparison timings on the above system (measured with
gettimeofday either side of the call in swriter).

Using an INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds.
US Thesaurus - hot OS cache
2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds.

USING NO INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds.
US Thesaurus - hot OS cache
2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds.

As can be seen from these numbers, it is around 3x slower for the US
thesaurus regardless of hot/cold cache.

> BTW, if I did that I'd probably do some major surgery on mythes and
> just use STL because it basically is doing C style memory management
> and processing and I think I would screw it up if I started messing
> with it.  The only problem with simplifying it with STL constructs is
> that I would want to change the interface (string vs char *), maybe
> use STL vectors for the list of synonyms, etc.

I've kept the public interface of mythes the same with my changes (but
the index file name in the constructor is ignored), apart from this
one:
const char* get_th_encoding();

I didn't change the mentry struct or code dealing with reading an
entry from the dat file at all.  The offset is loaded straight from
the std::map by word lookup but then falls back to the mythes C style
code.

It might be possible to make the index creation run quicker by
avoiding use of so many std::strings but I probably wouldn't do this
as it will make it harder to understand.

I did remove some private member functions that were no longer needed,
and some private data is now using std::string and std::map (as
per idxdict).

Now, assuming anyone thinks this is a good idea and the tradeoff of
initial lookup speed vs installation size is appropriate, I would
appreciate pointers as to how we would go about packaging up such a
change when it is completely isolated to messing about with 3rd party
source.  Naturally if this approach was selected then building the
.idx files and adding them to the language pack zips would need to be
removed.  A further option could be to have it use idx files if they
exist, but fallback to using only the .dat files.

Changes are LGPLv3+,MPL licensed.  I've attached the two altered files
here in case anyone wants to have a look and provide feedback on the
approach.

As this is simply proof of concept for the timing, I haven't tested
against memory leaks or corruption of data yet.

I'm also not sure how to format it as the original code is not well formatted.

Regards,
Steven Butler
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mythes.hxx
Type: text/x-c++hdr
Size: 1660 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20110130/b5d4611d/attachment-0001.hxx>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mythes.cxx
Type: text/x-c++src
Size: 8455 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20110130/b5d4611d/attachment-0001.cxx>