[Libreoffice] RC4 / Windows size analysis ...

Thu Jan 27 10:04:43 PST 2011

Hi Steven,

On Wed, 2011-01-26 at 15:17 +1000, Steven Butler wrote:
> > > One idea, can we generate thesaurus idx file during install? That may
> > > solve few megabytes.
..
> I have had an attempt at this - code attached, it is dual licensed under
> LGPL / MIT although there are no (c) headers in the file (feel free to add
> some).

	Wow - great work :-) I've just pushed this to dictionaries/source in
master, and compiled it there. Still need some tweaks to get it called
in the various dictionaries/ makefiles I suppose - but it is a great
start thanks !

	Licensing wise - I'd like to add the standard LGPLv3+/MPL header to it
(see bootstrap/) but having MIT too is fine if you want.

	I was going to add it as an easy hack, but you beat me to it :-)

> I have no idea how this would be integrated into the build process as I'm
> not even sure where it is called from, but happy if someone wants to
> take up the challenge and/or incorporate it as an installer process.

	So - the installer process is more exciting on Windows I think - we'll
need to see how the setup_native/ tools are called and be inspired by
that I think.

> Here's timing of the CPP version on a Core i5 amd64 generating the
> following indices:
..	
> The same set of files using th_gen_idx.pl took around 5 seconds (although
> some basic fixups got it done to 3.5 seconds).

	Great - its trivial; indeed - it rather makes you wonder whether we
need the indexes at all ? [ I wonder what they are good for, and/or what
code loads and uses them ;-]. We may discover that in fact there is no
need for them to be indexed - any chance of a dig around ?

> What I have noticed while testing the change was that a lot of the
> dictionaries I processed have errors.

	Nasty.

> These range from having the entry count incorrect, causing the index
> process to miss a word (lots of these in some dictionaries), to having
> words apparently duplicated either as the next entry, or sometimes a long
> way apart.

	That is bad; we should mail the l10n list to ask them to have a look I
suppose.

> I have not attempted to fix these dictionary issues, but if they are
> serious it might be worth having a perl script that is able to validate
> the dictionaries are internally consistent.  Unfortunately, it would have
> to use heuristics as the file format makes it difficult to tell in general
> what kind of line is being processed.

	Right; we should validate them as we compile the index perhaps - or at
least, look at the parser and see how it has traditionally interpreted
them.

> The CPP version attached has a difference from the perl script in that
> when multiple entries are found, they appear to be coming out in reverse
> order to the original perl script.  What I'm curious about is what impact
> Having multiple entries for a word when loaded into libreoffice?

	Me too ;-)

> For reference I have attached an improved perl version of the perl script
> that runs a couple of seconds faster than the original.  I had three to
> four versions in my tree but changing none of them triggered a git diff to
> show the changes so I've attached the full copy.

	The native code thing is great; it'd be wonderful if you had some time
to look at hooking it into the build process in dictionaries/ (?)

	Thanks muchly !

		Michael.

-- 
 michael.meeks at novell.com  <><, Pseudo Engineer, itinerant idiot