[HarfBuzz] hb_ot_tag_to_language

James Clark jjc at jclark.com
Sat Mar 8 03:17:24 PST 2014

For my own project, I needed to implement mapping from IETF language tags
to OpenType language system tags.  I ended up writing some code to generate
the mapping and then comparing the results with HarfBuzz.  For each case
where there was a discrepancy, I did enough research to convince myself of
the right result.  The HB source refers to a recent Microsoft draft, from
which some entries have been added; I skipped these entries (which I assume
are similar to the ones in the ISO 3rd ed WD 5, which I found here

I documented the research here


As a result I have a lot of comments about HarfBuzz's implementation.

First some stuff that is just typos.

"ber" should be mapped to BBR not BER.

There's a duplicate entry for "hz" not in sort order.

The entries for "sck", "vls", "wo" are not in sort order.

The tag for "tmh" is in lower case instead of upper case.

Some tags are missing a final zero. The ISO WD adds some 4-character tags,
whose last character is a zero.  There are four cases where these have been
added, but the final zero was incorrectly omitted: kab -> KAB0, ksh ->
KSH0, kg -> KON0, pap -> PAP0, sn -> SNA0.

The following entries appear in the spec, but are missing from HarfBuzz,
and they seem uncontroversial to me.

wlc CMR Mwali Comorian
wni CMR Ndzwani Comorian
zdj CMR Ngazidja Comorian
caf CRR Southern Carrier
co COS Corsican

The last is probably missing because it was omitted from the ISO WD; I
suspect this is a bug in the ISO WD.

HarfBuzz (and the OT spec) are inconsistent in their handling of
macrolanguages.  Sometimes when an IETF macrolanguage is mapped to an OT
lang, they also map the individual languages encompassed by the
macrolanguage to that OT tag and sometimes they don't.  I would suggest
that the consistent and reasonable policy is always to map the individual
languages to the same OT tag as the macrolanguage, unless the individual
language is separately mapped to a more specific OT tag. I created a file
with the additional entries that would be needed to implement this policy
in HarfBuzz:


The rest of my comments are not self-evident.  You will need to refer to
the notes I linked to above for my reasoning.

My first set of removal/additions is in accordance with the ISO 639 codes
in the spec. I suggest removing these mappings:

eot BTI Beti (Côte d'Ivoire)
kvd KUI Kui (Indonesia)
mdc MLE Male (Papua New Guinea)
mlq MNK Western Maninkakan
nco SIB Sibe
ril RIA Riang (India)
xom KMO Komo (Sudan)
yso NIS Nisi (China)

and adding these:

sjo SIB Xibe
pro PRO Old Provencal
rmz ARK Marma

The next set is not in the spec.  Remove:

xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired)

and add:

njz NIS Nyishi
tgj NIS Tagin
beb BTI Bebele
bum BTI Bulu (Cameroon)
bxp BTI Bebil
eto BTI Eton (Cameroon)
ewo BTI Ewondo
fan BTI Fang (Equatorial Guinea)
mct BTI Mengisa

Finally I have suggestions the commented out entries in the source:

/*{"ahg/awn/xan?", HB_TAG('A','G','W',' ')},*/ /* Agaw */

"ahg", "awn"

/*{"gsw?/gsw-FR?", HB_TAG('A','L','S',' ')},*/ /* Alsatian */


/*{"krc", HB_TAG('B','A','L',' ')},*/ /* Balkar */

Leave unmapped

/*{"??", HB_TAG('B','C','R',' ')},*/ /* Bible Cree */

Leave unmapped

/*{"zh?", HB_TAG('C','H','N',' ')},*/ /* Chinese (seen in Microsoft fonts)


/*{"acf/gcf?", HB_TAG('F','A','N',' ')},*/ /* French Antillean */

"acf", "gcf"

/*{"enf?/yrk?", HB_TAG('F','N','E',' ')},*/ /* Forest Nenets */

Leave unmapped

/*{"fuf?", HB_TAG('F','T','A',' ')},*/ /* Futa */


/*{"ar-Syrc?", HB_TAG('G','A','R',' ')},*/ /* Garshuni */


/*{"cfm/rnl?", HB_TAG('H','A','L',' ')},*/ /* Halam */


/*{"fonipa", HB_TAG('I','P','P','H')},*/ /* Phonetic transcription—IPA
conventions */

"und-fonipa", or better map anything with a variant of "fonipa"

/*{"ga-Latg?/Latg?", HB_TAG('I','R','T',' ')},*/ /* Irish Traditional */


/*{"krc", HB_TAG('K','A','R',' ')},*/ /* Karachay */


/*{"alw?/ktb?", HB_TAG('K','E','B',' ')},*/ /* Kebena */


/*{"Geok", HB_TAG('K','G','E',' ')},*/ /* Khutsuri Georgian */

"ka-Geok" (Georgian written with the Khutsuri script)

/*{"kca", HB_TAG('K','H','K',' ')},*/ /* Khanty-Kazim */


/*{"kca", HB_TAG('K','H','S',' ')},*/ /* Khanty-Shurishkar */

Leave unmapped

/*{"kca", HB_TAG('K','H','V',' ')},*/ /* Khanty-Vakhi */

Leave unmapped

/*{"guz?/kqs?/kss?", HB_TAG('K','I','S',' ')},*/ /* Kisii */


/*{"kfa/kfi?/kpb?/xua?/xuj?", HB_TAG('K','O','D',' ')},*/ /* Kodagu */


/*{"okm?/oko?", HB_TAG('K','O','H',' ')},*/ /* Korean Old Hangul */


/*{"kon?/ktu?/...", HB_TAG('K','O','N',' ')},*/ /* Kikongo */


/*{"kfx?", HB_TAG('K','U','L',' ')},*/ /* Kulvi */


/*{"??", HB_TAG('L','A','H',' ')},*/ /* Lahuli */

"lbf", "lae", "bfu"

/*{"??", HB_TAG('L','C','R',' ')},*/ /* L-Cree */

Leave unmapped

/*{"??", HB_TAG('M','A','L',' ')},*/ /* Malayalam Traditional */

Leave unmapped

/*{"mnk?/mlq?/...", HB_TAG('M','L','N',' ')},*/ /* Malinke */


/*{"??", HB_TAG('N','C','R',' ')},*/ /* N-Cree */


/*{"??", HB_TAG('N','H','C',' ')},*/ /* Norway House Cree */

Leave unmapped

/*{"jpa?/sam?", HB_TAG('P','A','A',' ')},*/ /* Palestinian Aramaic */

"jpa", "sam"

/*{"polyton", HB_TAG('P','G','R',' ')},*/ /* Polytonic Greek */


/*{"??", HB_TAG('Q','I','N',' ')},*/ /* Asho Chin */


(The spec says Chin not Asho Chin.)

/*{"??", HB_TAG('R','C','R',' ')},*/ /* R-Cree */


/*{"chp?", HB_TAG('S','A','Y',' ')},*/ /* Sayisi */

Leave unmapped

/*{"xan?", HB_TAG('S','E','K',' ')},*/ /* Sekota */


/*{"ngo?", HB_TAG('S','X','T',' ')},*/ /* Sutu */

Leave unmapped

/*{"??", HB_TAG('T','C','R',' ')},*/ /* TH-Cree */

Leave unmapped

/*{"tnz?/tog?/toi?", HB_TAG('T','N','G',' ')},*/ /* Tonga */


/*{"enh?/yrk?", HB_TAG('T','N','E',' ')},*/ /* Tundra Nenets */


/*{"??", HB_TAG('W','C','R',' ')},*/ /* West-Cree */

Leave unmapped

/*{"cre?", HB_TAG('Y','C','R',' ')},*/ /* Y-Cree */


/*{"??", HB_TAG('Y','I','C',' ')},*/ /* Yi Classic */

Leave unmapped

/*{"ii?/Yiii?", HB_TAG('Y','I','M',' ')},*/ /* Yi Modern */


It would also be desirable to map otherwise unmapped languages in the
Yi script (ie with with a script code of Yiii) to YIM.

/*{"??", HB_TAG('Z','H','P',' ')},*/ /* Chinese Phonetic */


I'll have some more general comments later.

