Tagging text as being in arbitrary complex-script languages

Eike Rathke erack at redhat.com
Thu Apr 18 10:25:11 UTC 2019


Hi,

On Wednesday, 2019-04-17 22:11:58 +0100, Richard Wordingham wrote:

> Is there a pointer as to which tag sequences that "strictly follow the
> BCP 47 language tag specification" are "correct"?

"strictly" here means, do not invent stuff, specifically not anything
that is not covered by the syntax defined, e.g. es-ES_tradnl is not
a valid language tag; do not invent language codes, tags or subtags, do
not use unassigned language codes. Be aware that an "x-..." private use
tag indeed *does* mean private and thus should not be stored in
documents that reach the wild.

"correct" here also means, furthermore than being strict it should make
sense.. e.g. assigning a ...-Latn tag to the CTL category does not make
sense, a language-script combination that does not exist also doesn't
make sense.

> As far as I can tell, the following all strictly follow the
> specification:

Yes.

> "sa-IN" Sanskrit as used in India - so far as I can tell, that could be
> in, for example, Devanagari, Grantha or even the Tamil script!  For
> Devanagari at least, I understand that this implies that homorganic
> nasals may be written using U+0902 DEVANAGARI SIGN ANUSVARA.

If in doubt, ask Microsoft if the in isolang.cxx assigned LCID isn't
LANGUAGE_USER_..., here it is LANGUAGE_SANSKRIT 0x044F. Most of these
even predate the existance of BCP 47 when only combinations of language
code and country code were used (also due to the Java Locale
restrictions).

What I usually did is, lookup the language at SIL and the Ethnologue and
use the most prevalent script as implied default script. Which here
https://www.ethnologue.com/language/san would lead to Devanagari, but in
this case more important is also what MS assigned the LCID for.

> "sa-150" Sanskrit written using European conventions - so, any script,
> but, at least for Devanagari, the anusvara sign is not used for
> homorganic nasals.

Though valid, LibreOffice doesn't use the numeric UN M.49 code, it may
be accepted but might not work everywhere.

> "sa-Deva-150" Sanskrit written in Devanagari in the manner used in
> Europe.

Same here.

> "sa-Latn" Sanskrit written in the Roman script.
> 
> "sa-Latf" Sanskrit written in Fraktur (I'm not sure that this exists.
> It might need a hint as to where to find a Fraktur script with a
> combining candrabindu.)

Both perfectly valid, if they serve any purpose. Though with sa-Latn
I doubt there's a use case, so I wouldn't call that "correct" in common
sense.

I also just learned that sa-Latf somehow exists..

  Eike

-- 
GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918  630B 6A6C D5B7 6563 2D3A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20190418/69003b67/attachment.sig>


More information about the LibreOffice mailing list