[Grammar checking] Using LanguageTool lexicons with Lightproof new possible

Wed Dec 5 02:17:21 PST 2012

Hi Olivier,

Great work! Now it's possible to write a full rule converter for the
existing LanguageTool modules. I will add your library to the lightproof
module with an example/test.

By the way, I have ported the lightproof modules to Python 3.3, but it
seems, there is a registration issue with the bundled dictionary packages
with Lightproof components (unfortunately, I couldn't test it yesterday,
because the daily build had a missing library problem on Ubuntu), so I will
write soon.

Best regards,
László

2012/12/4 Olivier R. <olivier.noreply at gmail.com>

> My connection ended while posting. Here is the full post:
>
>
> Hello everyone,
>
> ## Build indexable binary grammatically tagged dictionaries for
> Lightproof/Grammalecte ##
>
> The most important limitation for building a grammar checker with
> Lightproof
> was the lack of grammatically tagged dictionaries. Most of Hunspell
> dictionaries, which Lightproof can handle via LibreOffice-UNO, are not
> grammatically tagged and cannot be of any help to retrieve morphological
> information about words.
>
> LanguageTool has not this problem since it’s using binary indexable
> dictionaries built on huge grammatically tagged lexicons with a
> finite-state
> automaton (fsa) software
> (
> http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html
> )
> written in C. Java has a dedicated library to read these binary files.
>
> But we had nothing such as this in Python.
> So I tried to understand how this FSA software in C works, but as I am not
> a
> C expert and as I was upset to depend again on another software, I finally
> decided to write my own FSA tool to build such indexable binary
> dictionaries.
>
> Why build such dictionaries? you may ask. Because lexicons which contain
> words, lemmas and morphological tags are HUGE, up to several megabytes,
> they
> are not indexable as is and it uses much more memory to make them such. So
> the goal is to make them small, compressed, quick to load and to parse, low
> memory consuming, indexable, readable without having to uncompress them.
>
> That’s what I did with Python 3.3.
>
> I took all lexicons from LanguageTool and I compressed them in binary
> indexable dictionaries readable with my own script.
> The built dictionaries are not as small as the ones made with the C FSA
> tool
> used by LT, but it’s close enough and there is still room for improvements.
> I’ll work on this later.
>
> Here are the results:
>
>
> These dictionaries are about 5-30 % bigger than the LT ones (and sometimes
> surprisingly twice smaller), but anyhow it’s perfectly usable as is.
>
> Consequences:
> — it will be possible to use all existing LT lexicons with Lightproof,
> — we will be able to make a stand-alone version of Lightproof/Grammalecte
> as
> it won’t be necessary to use Hunspell anymore,
> — we will be able to write automated tests and prevent regressions when
> writting/modifying rules.
>
>
> # Lexicons
>
> Lexicon are simple text document listing all flexions, their stem and their
> morphological tags:
>
>
>
> Each field is separated with a tabulation.
>
> With the new tool, lexicons MUST be UTF-8 encoded to be properly converted.
>
>
> # Want to test it?
>
> The code is written with Python 3.3. License: MPL 2.
>
> Two files:
> — fsa_builder.py      reads all files listed in "_lexicons.list.txt" and
> builds binary dictionaries with a specific stemming command.
> — fsa_reader.py       reads all files whose name is "[lang].bdic", and if
> it
> finds a test file named "[lang].test.txt" writes results found for each
> word
> in a new file.
>
> The builder with uncompressed LT lexicons encoded in UTF-8:
> http://dicollecte.free.fr/download/fsa1/pyFSA_builder.7z [130 Mb]
>
> Type:
>
>
>
> And let it run. Warning: building dictionaries is slow, as lexicons are
> huge. For most langages it takes 1 or 2 minutes for each. But for german,
> polish, galician, russian, czech, it tooks 5 to 10 minutes for each, and it
> consumes a huge amount of memory. The czech uses up to 6 Gb! You have been
> warned. :)
>
> The dictionary reader with binary dictionaries and test files:
> http://dicollecte.free.fr/download/fsa1/pyFSA_reader.7z [11 Mb]
>
> Type:
>
>
>
> Let it run. Count to 1 (or 2 if you have a slow computer). And it’s already
> finished. :)
> It has read all binary dictionaries, read the test files, and written the
> results in other files.
>
> I’ll try to write a more complete web page about this when I have the time.
> I still have to compress it better, for those who might think it’s not
> enough.
>
>
> Regards,
> Olivier R.
>
>
>
> --
> View this message in context:
> http://nabble.documentfoundation.org/Grammar-checking-Using-LanguageTool-lexicons-with-Lightproof-new-possible-tp4022489p4022495.html
> Sent from the Dev mailing list archive at Nabble.com.
> _______________________________________________
> LibreOffice mailing list
> LibreOffice at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20121205/e6715342/attachment.html>