Tagging text as being in arbitrary complex-script languages

Eike Rathke erack at redhat.com
Thu Apr 18 10:25:11 UTC 2019


On Wednesday, 2019-04-17 22:11:58 +0100, Richard Wordingham wrote:

> Is there a pointer as to which tag sequences that "strictly follow the
> BCP 47 language tag specification" are "correct"?

"strictly" here means, do not invent stuff, specifically not anything
that is not covered by the syntax defined, e.g. es-ES_tradnl is not
a valid language tag; do not invent language codes, tags or subtags, do
not use unassigned language codes. Be aware that an "x-..." private use
tag indeed *does* mean private and thus should not be stored in
documents that reach the wild.

"correct" here also means, furthermore than being strict it should make
sense.. e.g. assigning a ...-Latn tag to the CTL category does not make
sense, a language-script combination that does not exist also doesn't
make sense.

> As far as I can tell, the following all strictly follow the
> specification:


> "sa-IN" Sanskrit as used in India - so far as I can tell, that could be
> in, for example, Devanagari, Grantha or even the Tamil script!  For
> Devanagari at least, I understand that this implies that homorganic
> nasals may be written using U+0902 DEVANAGARI SIGN ANUSVARA.

If in doubt, ask Microsoft if the in isolang.cxx assigned LCID isn't
LANGUAGE_USER_..., here it is LANGUAGE_SANSKRIT 0x044F. Most of these
even predate the existance of BCP 47 when only combinations of language
code and country code were used (also due to the Java Locale

What I usually did is, lookup the language at SIL and the Ethnologue and
use the most prevalent script as implied default script. Which here
https://www.ethnologue.com/language/san would lead to Devanagari, but in
this case more important is also what MS assigned the LCID for.

> "sa-150" Sanskrit written using European conventions - so, any script,
> but, at least for Devanagari, the anusvara sign is not used for
> homorganic nasals.

Though valid, LibreOffice doesn't use the numeric UN M.49 code, it may
be accepted but might not work everywhere.

> "sa-Deva-150" Sanskrit written in Devanagari in the manner used in
> Europe.

Same here.

> "sa-Latn" Sanskrit written in the Roman script.
> "sa-Latf" Sanskrit written in Fraktur (I'm not sure that this exists.
> It might need a hint as to where to find a Fraktur script with a
> combining candrabindu.)

Both perfectly valid, if they serve any purpose. Though with sa-Latn
I doubt there's a use case, so I wouldn't call that "correct" in common

I also just learned that sa-Latf somehow exists..


GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918  630B 6A6C D5B7 6563 2D3A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20190418/69003b67/attachment.sig>

More information about the LibreOffice mailing list