[Libreoffice] RC4 / Windows size analysis ...

Steven Butler stevenb at kjross.com.au
Tue Jan 25 21:17:26 PST 2011


> > One idea, can we generate thesaurus idx file during install? That may
> > solve few megabytes.

> 	Oh - right; 4Mb of that - which we can (I assume easily) build at
> install time; I've added that to the spreadsheet, and re-up-loaded it.
> It should be quite fun in fact to re-write the somewhat trivial
> dictionaries/util/th_gen_idx.pl script as a standalone C++ tool - would
> be faster too: it takes ~5 CPU seconds each to index those beasties in
> perl, which would be ~instant in C++.

I have had an attempt at this - code attached, it is dual licensed under
LGPL / MIT although there are no (c) headers in the file (feel free to add
some).

I have no idea how this would be integrated into the build process as I'm
not even sure where
it is called from, but happy if someone wants to take up the challenge
and/or incorporate it
as an installer process.

Here's timing of the CPP version on a Core i5 amd64 generating the
following indices:

libo/clone/libs-extern-sys/dictionaries/ca/th_ca_ES_v3.dat.idx2
libo/clone/libs-extern-sys/dictionaries/cs_CZ/th_cs_CZ_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/da_DK/th_da_DK.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_AT/th_de_AT_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_CH/th_de_CH_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/de_DE/th_de_DE_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/en/th_en_US_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/fr_FR/thes_fr.dat.idx2
libo/clone/libs-extern-sys/dictionaries/hu_HU/th_hu_HU_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/it_IT/th_it_IT_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ne_NP/th_ne_NP_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/no/th_nb_NO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/no/th_nn_NO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/pl_PL/th_pl_PL_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ro/th_ro_RO_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/ru_RU/th_ru_RU_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/sk_SK/th_sk_SK_v2.dat.idx2
libo/clone/libs-extern-sys/dictionaries/sl_SI/th_sl_SI_v2.dat.idx2

real    0m0.792s
user    0m0.630s
sys     0m0.080s

The same set of files using th_gen_idx.pl took around 5 seconds (although
some basic fixups got it done to 3.5 seconds).

What I have noticed while testing the change was that a lot of the
dictionaries I processed have errors.

These range from having the entry count incorrect, causing the index
process to miss a word (lots of these in some dictionaries), to having
words apparently duplicated either as the next entry, or sometimes a long
way apart.

I have not attempted to fix these dictionary issues, but if they are
serious it might be worth having a perl script that is able to validate
the dictionaries are internally consistent.  Unfortunately, it would have
to
use heuristics as the file format makes it difficult to tell in general
what kind of line is being processed.

The CPP version attached has a difference from the perl script in that
when multiple entries are found, they appear to be coming out in reverse
order to the original perl script.  What I'm curious about is what impact
Having multiple entries for a word when loaded into libreoffice?

For reference I have attached an improved perl version of the perl script
that runs a couple of seconds faster than the original.  I had three to
four versions in my tree but changing none of them triggered a git diff to
show the changes so I've attached the full copy.

Cheers
Steve.
-------------- next part --------------

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <stdlib.h>
#include <string.h>

static const int MAXLINE = 1024*64;

using namespace std;

int main(int argc, char *argv[])
{
	if (argc != 3 || strcmp(argv[1],"-o"))
	{
		cout << "Usage: th_gen_idx -o outputfile < input\n";
		::exit(99);
	}
	// This call improves performance by approx 5x
	cin.sync_with_stdio(false);

	const char * outputFile(argv[2]);
	char inputBuffer[MAXLINE];
	multimap<string, size_t> entries;
	multimap<string,size_t>::iterator ret(entries.begin());

	int line(1);
	cin.getline(inputBuffer, MAXLINE);
	const string encoding(inputBuffer);
	size_t currentOffset(encoding.size()+1);
	while (true)
	{
		// Extract the next word, but not the entry count
		cin.getline(inputBuffer, MAXLINE, '|');

		if (cin.eof()) break;

		string word(inputBuffer);
		ret = entries.insert(ret, pair<string, size_t>(word, currentOffset));
		currentOffset += word.size() + 1;
		// Next is the entry count
		cin.getline(inputBuffer, MAXLINE);
		if (!cin.good())
		{
			cerr << "Unable to read entry - insufficient buffer?.\n";
			exit(99);
		}
		currentOffset += strlen(inputBuffer)+1;
		int entryCount(strtol(inputBuffer, NULL, 10));
		for (int i(0); i < entryCount; ++i)
		{
			cin.getline(inputBuffer, MAXLINE);
			currentOffset += strlen(inputBuffer)+1;
			++line;
		}
	}

	// Use binary mode to prevent any translation of LF to CRLF on Windows
	ofstream outputStream(outputFile, ios_base::binary| ios_base::trunc|ios_base::out);
	if (!outputStream.is_open())
	{
		cerr << "Unable to open output file " << outputFile << endl;
		::exit(99);
	}

	cout << outputFile << endl;

	outputStream << encoding << '\n' << entries.size() << '\n';

	for (multimap<string, size_t>::const_iterator ii(entries.begin());
		ii != entries.end();
		++ii
	)
	{
		outputStream << ii->first << '|' << ii->second << '\n';
	}
	
	
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: th_gen_idx.pl
Type: application/octet-stream
Size: 2933 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20110126/1e4dc7b7/attachment-0001.obj>


More information about the LibreOffice mailing list