[Libreoffice] Utility to scan for some faults in Thesaurus files
sebutler at gmail.com
Sat Jan 29 03:03:28 PST 2011
The l10n guys said they needed this utility in git as their list does
not allow attachments. I will leave it up to your wisdom as to where
to put it.
A brief description of what it does:
I have made some assumptions about the file format to look for the
common errors I found:
1. A line that starts with 1 or more characters followed by a |, then
only digits to EOL is a word definition.
2. A line that starts with either ( or - is a synonym definition.
This may not be a valid assumption as I've seen lines that start with
interj that were definitely synonym definitions. I am not sure what
interj means in th_ro_RO_v2.dat so I have special cased interj and
prep to also be a synonym line, but still complain about them.
I'm not sure if the inconsistency in naming is related to l10n issues or
just an inconsistency so I've left it on nag for now.
With these assumptions the script compares the expected number of
synonyms with the actual number of synonyms and complains if they
don't match (with word and line numbers displayed for the definition).
It will also complain if it finds the same word more than once and
will print out both lines on which the suspect word was found.
I hope this helps - the script finds no issues in a number of
dictionaries, but output this many informational lines for the
following dictionaries in my libreoffice build tree:
I hope this helps. The perl script is MPL 1.1 / GPLv3+ / LGPLv3+ as
per the header.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3183 bytes
Desc: not available
More information about the LibreOffice