Tagging text as being in arbitrary complex-script languages

Eike Rathke erack at redhat.com
Wed Apr 17 11:53:25 UTC 2019

Hi Richard,

On Tuesday, 2019-04-16 21:04:42 +0100, Richard Wordingham wrote:

> On Mon, 15 Apr 2019 15:14:49 +0000
> jonathon <toki.kantoor at gmail.com> wrote:
> > On 4/15/19 12:26 PM, Eike Rathke wrote:
> > > Adding arbitrary dictionary languages (as long as they strictly
> > > follow the BCP 47 language tag specification) works since quite a
> > > while (2014?) already.
> Only if you hacked the text to declare the CTL or CJK language as
> appropriate to be the one of the dictionary. Otherwise, you could only
> use such a dictionary for a 'Western' script.

Well, that's what I wrote.. and that specifying the internal scripttype
category Western/CJK/CTL was added later.

> > > New(er) in the mentioned mechanism is the
> > > ability to add a language also to the CTL or CJK sections where
> > > previously it was only possible to add to the (misnamed) "Western"
> > > section, and give the language list entries a proper UI name
> > > instead of showing just the language tag.
> > Thanks.
> > I wasn't aware that that functionality was present.
> > I'll play with over the next month or so, then write about in my
> > long-neglected blog.
> An interesting experiment would be to try adding a language to both
> Western and CTL (as with Mongolian and some minor SEA languages) or
> Western and CJK (various Zhuang writing systems), though I suppose it
> won't hurt to simply disambiguate by script.

In fact you have to, or use an ISO 639-1/2/3 language code that implies
a default script for one and specify an ISO 15924 script code for the
other, which I was referring with "correct BCP 47 language tags".

Mongolian is slightly more complicated because historically it uses the
'mn' macrolanguage code (that probably better should had been 'khk'),
'mn-Cyrl' for Mongolian in Cyrillic script (so instead of 'mn-Cyrl-MN'
it could had been 'khk-MN'). For Mongolian in Mongolian script there is
'mn-Mong' for example with 'mn-Mong-CN'.

See the tables in i18nlangtag/source/isolang/isolang.cxx for our known

Note also that used language tag attributes are saved with the document,
so once introduced they will have to be supported for ~ever, just
changing them later without having at least a forward mapping (in said
isolang.cxx) to load existing documents using them is a no-no.


