Request for [API CHANGE] in spell checking: add new options to disable rule-based compounding

Németh László nemeth at numbertext.org
Wed Jan 4 13:18:40 UTC 2023


Hi,

I've started to add two new spell checking options to
css.linguistic2.XLinguProperties (screen shot:
https://wiki.documentfoundation.org/images/a/a2/Spelling_options_compound.png),
which can improve spell checking a lot. Because API changes need more
attention, please check the Rationale below, comment on the extension, or
the caption of the check boxes, and the patch itself, especially if
backwards compatibility is accidentally broken (I don't know about it.) It
it's ok for you, my plan is to extend the help (and follow on the other
Hunspell problems, e.g. too redundant suggestions in several cases).

>From the commit description:

“For professional proofreaders, it can be more important to avoid the
mistakes of the rule-based compound word recognition, than
to speed up proofreading. Disabling the following two new options
will report all rule-based closed compound words (default in
Dutch, German, Hungarian etc. dictionaries) and rule-based
hyphenated compound words (all languages with BREAK usage in
their Hunspell dictionaries):

- "Accept possible closed compound words"

- "Accept possible hyphenated compound words"

For example, disabling the second one, dictionary word "scot-free"
will be still correct word in English spell checking, but not
the previously accepted compound "arbitrary-word-with-hyphen".”

Commit:
https://git.libreoffice.org/core/+/57d79744c77eef96b4c2bd3b16e0a04317ffcf9e%5E%21

Rationale:

Spell checker of MS Office and Google Docs started to use the "common
knowledge" by collecting words and user feedback from the internet. It's
cheap and up-to-date, and likely good enough for writing private messages,
but it's not for professional document editing (see for example user
feedback of Word „new version of the spell checker is awful”:
https://answers.microsoft.com/en-us/msoffice/forum/all/spell-check-problems/10078dbf-855a-4154-afb4-fac5e5c24ad8).
Several languages, like Dutch, French, German, Hungarian use an academic
approach, i.e. an orthography standardized by the government/national
bodies, see for example the official status of Duden in Germany. A spell
checker, which accepts spelling mistakes, because they are frequently used
by the users, is the opposite of a spell checker, at least in a document
editor. Thanks to the lazy approach of the other document editors, spell
checker of Writer can be more attractive for the professionals than before.
Hunspell and Hunspell dictionaries are not perfect either. An old request
from the editors to disable the rule-based compound words optionally,
because while rule-based approach eliminated the false alarms successfully
(note: German-like orthography generated millions of “single-use” correct
word forms, which not possible to list in a spelling dictionary), it
resulted in the malfunction of spell checking: typos and missing spaces
between words skipped by the spell checker frequently. Hunspell had got a
successful solution to limit this in the most important cases: if the
possible rule-based compound word is also a dictionary word with a serious
spelling mistake, the word form was reported as a spelling mistakes (see
REP and CHECKCOMPOUNDREP in
https://github.com/hunspell/hunspell/blob/master/man/hunspell.5). The new
Hunspell 1.7.2 added a similar feature to the rule-based compound words
composed from 3 or more words (
https://github.com/hunspell/hunspell/commit/ff3591b0f76950f13d73123d03a03edd9a892945).
But this is not enough: other typos are still recognized as compound words
by the rule-based compounding. The new options are not exactly new in the
case of  Hungarian: Lightproof spell checker has already contained the
options “Underline all typo-like compound words” and “Underline all
generated compound words”. This feature is important enough to be available
for all languages with the same potential problem. If the editor wants more
realistic, i.e. strict dictionary-based spell checking, disables these new
options, and with some effort, can fix the typos and the missing spaces
without reading 300 pages of a book (or otherwise, too: reading the book
does not guarantee that you will be able to spot typos).

Best regards,
László
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20230104/30d9841e/attachment.htm>


More information about the LibreOffice mailing list