Tagging text as being in arbitrary complex-script languages

Richard Wordingham richard.wordingham at ntlworld.com
Tue Apr 16 20:04:42 UTC 2019


On Mon, 15 Apr 2019 15:14:49 +0000
jonathon <toki.kantoor at gmail.com> wrote:

> On 4/15/19 12:26 PM, Eike Rathke wrote:

> > Adding arbitrary dictionary languages (as long as they strictly
> > follow the BCP 47 language tag specification) works since quite a
> > while (2014?) already.

Only if you hacked the text to declare the CTL or CJK language as
appropriate to be the one of the dictionary. Otherwise, you could only
use such a dictionary for a 'Western' script.

As recently as 2015, another issue was that I was having to regenerate
hunspell/utf_info.cxx for a LibreOffice build so that it would accept
word characters as word characters.  I don't know how well that file
tracks the Unicode standard nowadays.  When should Pali spell-checking
in the extended Lao script (Pali support to 1930's standards was only
added this year) only have problems due to the inadequacy of the
dictionaries?

> > New(er) in the mentioned mechanism is the
> > ability to add a language also to the CTL or CJK sections where
> > previously it was only possible to add to the (misnamed) "Western"
> > section, and give the language list entries a proper UI name
> > instead of showing just the language tag.

> Thanks.
> I wasn't aware that that functionality was present.

> I'll play with over the next month or so, then write about in my
> long-neglected blog.

An interesting experiment would be to try adding a language to both
Western and CTL (as with Mongolian and some minor SEA languages) or
Western and CJK (various Zhuang writing systems), though I suppose it
won't hurt to simply disambiguate by script. In general, tagging has the
potential to get very messy, e.g. Pali in Lanna script as used in
Northern Thailand as opposed to Pali in Lanna script as used in
North-eastern Thailand. (Yes, there are systematic spelling differences
between the two.)

Richard.


More information about the LibreOffice mailing list