[HarfBuzz] hb_ot_tag_to_language

Behdad Esfahbod behdad at behdad.org
Sat Mar 8 09:58:56 PST 2014


Wow.  Thanks James!  Will study in detail.  Roozbeh: we should that together
perhaps.

On 14-03-08 03:17 AM, James Clark wrote:
> For my own project, I needed to implement mapping from IETF language tags to
> OpenType language system tags.  I ended up writing some code to generate the
> mapping and then comparing the results with HarfBuzz.  For each case where
> there was a discrepancy, I did enough research to convince myself of the right
> result.  The HB source refers to a recent Microsoft draft, from which some
> entries have been added; I skipped these entries (which I assume are similar
> to the ones in the ISO 3rd ed WD 5, which I found
> here http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition).
> 
> I documented the research here 
> 
> https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md
> 
> As a result I have a lot of comments about HarfBuzz's implementation.
> 
> First some stuff that is just typos.
> 
> "ber" should be mapped to BBR not BER.
> 
> There's a duplicate entry for "hz" not in sort order.
> 
> The entries for "sck", "vls", "wo" are not in sort order.
> 
> The tag for "tmh" is in lower case instead of upper case.
> 
> Some tags are missing a final zero. The ISO WD adds some 4-character tags,
> whose last character is a zero.  There are four cases where these have been
> added, but the final zero was incorrectly omitted: kab -> KAB0, ksh -> KSH0,
> kg -> KON0, pap -> PAP0, sn -> SNA0.
> 
> The following entries appear in the spec, but are missing from HarfBuzz, and
> they seem uncontroversial to me.
> 
> wlc CMR Mwali Comorian
> wni CMR Ndzwani Comorian
> zdj CMR Ngazidja Comorian
> caf CRR Southern Carrier
> co COS Corsican
> 
> The last is probably missing because it was omitted from the ISO WD; I suspect
> this is a bug in the ISO WD.
> 
> HarfBuzz (and the OT spec) are inconsistent in their handling of
> macrolanguages.  Sometimes when an IETF macrolanguage is mapped to an OT lang,
> they also map the individual languages encompassed by the macrolanguage to
> that OT tag and sometimes they don't.  I would suggest that the consistent and
> reasonable policy is always to map the individual languages to the same OT tag
> as the macrolanguage, unless the individual language is separately mapped to a
> more specific OT tag. I created a file with the additional entries that would
> be needed to implement this policy in HarfBuzz:
> 
> https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt
> 
> The rest of my comments are not self-evident.  You will need to refer to the
> notes I linked to above for my reasoning.
> 
> My first set of removal/additions is in accordance with the ISO 639 codes in
> the spec. I suggest removing these mappings:
> 
> eot BTI Beti (Côte d'Ivoire)
> kvd KUI Kui (Indonesia)
> mdc MLE Male (Papua New Guinea)
> mlq MNK Western Maninkakan
> nco SIB Sibe
> ril RIA Riang (India)
> xom KMO Komo (Sudan)
> yso NIS Nisi (China)
> 
> and adding these:
> 
> sjo SIB Xibe
> pro PRO Old Provencal
> rmz ARK Marma
> 
> The next set is not in the spec.  Remove:
> 
> xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired)
> 
> and add:
> 
> njz NIS Nyishi
> tgj NIS Tagin
> beb BTI Bebele
> bum BTI Bulu (Cameroon)
> bxp BTI Bebil
> eto BTI Eton (Cameroon)
> ewo BTI Ewondo
> fan BTI Fang (Equatorial Guinea)
> mct BTI Mengisa
> 
> Finally I have suggestions the commented out entries in the source:
> 
> /*{"ahg/awn/xan?",HB_TAG('A','G','W',' ')},*//* Agaw */
> 
> "ahg", "awn"
> 
> /*{"gsw?/gsw-FR?",HB_TAG('A','L','S',' ')},*//* Alsatian */
> 
> "gsw"
> 
> /*{"krc",HB_TAG('B','A','L',' ')},*//* Balkar */
> 
> Leave unmapped
> 
> /*{"??",HB_TAG('B','C','R',' ')},*//* Bible Cree */
> 
> Leave unmapped
> 
> /*{"zh?",HB_TAG('C','H','N',' ')},*//* Chinese (seen in Microsoft fonts) */
> 
> ???
> 
> /*{"acf/gcf?",HB_TAG('F','A','N',' ')},*//* French Antillean */
> 
> "acf", "gcf"
> 
> /*{"enf?/yrk?",HB_TAG('F','N','E',' ')},*//* Forest Nenets */
> 
> Leave unmapped
> 
> /*{"fuf?",HB_TAG('F','T','A',' ')},*//* Futa */
> 
> "fuf"
> 
> /*{"ar-Syrc?",HB_TAG('G','A','R',' ')},*//* Garshuni */
> 
> "ar-Syrc"
> 
> /*{"cfm/rnl?",HB_TAG('H','A','L',' ')},*//* Halam */
> 
> "cfm"
> 
> /*{"fonipa",HB_TAG('I','P','P','H')},*//* Phonetic transcription—IPA
> conventions */
> 
> "und-fonipa", or better map anything with a variant of "fonipa"
> 
> /*{"ga-Latg?/Latg?",HB_TAG('I','R','T',' ')},*//* Irish Traditional */
> 
> "ga-Latg"
> 
> /*{"krc",HB_TAG('K','A','R',' ')},*//* Karachay */
> 
> "krc"
> 
> /*{"alw?/ktb?",HB_TAG('K','E','B',' ')},*//* Kebena */
> 
> "alw"
> 
> /*{"Geok",HB_TAG('K','G','E',' ')},*//* Khutsuri Georgian */
> 
> "ka-Geok" (Georgian written with the Khutsuri script)
> 
> /*{"kca",HB_TAG('K','H','K',' ')},*//* Khanty-Kazim */
> 
> "kca"
> 
> /*{"kca",HB_TAG('K','H','S',' ')},*//* Khanty-Shurishkar */
> 
> Leave unmapped
> 
> /*{"kca",HB_TAG('K','H','V',' ')},*//* Khanty-Vakhi */
> 
> Leave unmapped
> 
> /*{"guz?/kqs?/kss?",HB_TAG('K','I','S',' ')},*//* Kisii */
> 
> "guz"
> 
> /*{"kfa/kfi?/kpb?/xua?/xuj?",HB_TAG('K','O','D',' ')},*//* Kodagu */
> 
> "kfa"
> 
> /*{"okm?/oko?",HB_TAG('K','O','H',' ')},*//* Korean Old Hangul */
> 
> "okm"
> 
> /*{"kon?/ktu?/...",HB_TAG('K','O','N',' ')},*//* Kikongo */
> 
> "ktu"
> 
> /*{"kfx?",HB_TAG('K','U','L',' ')},*//* Kulvi */
> 
> "kfx"
> 
> /*{"??",HB_TAG('L','A','H',' ')},*//* Lahuli */
> 
> "lbf", "lae", "bfu"
> 
> /*{"??",HB_TAG('L','C','R',' ')},*//* L-Cree */
> 
> Leave unmapped
> 
> /*{"??",HB_TAG('M','A','L',' ')},*//* Malayalam Traditional */
> 
> Leave unmapped
> 
> /*{"mnk?/mlq?/...",HB_TAG('M','L','N',' ')},*//* Malinke */
> 
> "mlq"
> 
> /*{"??",HB_TAG('N','C','R',' ')},*//* N-Cree */
> 
> "csw"
> 
> /*{"??",HB_TAG('N','H','C',' ')},*//* Norway House Cree */
> 
> Leave unmapped
> 
> /*{"jpa?/sam?",HB_TAG('P','A','A',' ')},*//* Palestinian Aramaic */
> 
> "jpa", "sam"
> 
> /*{"polyton",HB_TAG('P','G','R',' ')},*//* Polytonic Greek */
> 
> "el-polyton"
> 
> /*{"??",HB_TAG('Q','I','N',' ')},*//* Asho Chin */
> 
> "tbq"
> 
> (The spec says Chin not Asho Chin.)
> 
> /*{"??",HB_TAG('R','C','R',' ')},*//* R-Cree */
> 
> "atj"
> 
> /*{"chp?",HB_TAG('S','A','Y',' ')},*//* Sayisi */
> 
> Leave unmapped
> 
> /*{"xan?",HB_TAG('S','E','K',' ')},*//* Sekota */
> 
> "xan"
> 
> /*{"ngo?",HB_TAG('S','X','T',' ')},*//* Sutu */
> 
> Leave unmapped
> 
> /*{"??",HB_TAG('T','C','R',' ')},*//* TH-Cree */
> 
> Leave unmapped
> 
> /*{"tnz?/tog?/toi?",HB_TAG('T','N','G',' ')},*//* Tonga */
> 
> "toi"
> 
> /*{"enh?/yrk?",HB_TAG('T','N','E',' ')},*//* Tundra Nenets */
> 
> "yrk"
> 
> /*{"??",HB_TAG('W','C','R',' ')},*//* West-Cree */
> 
> Leave unmapped
> 
> /*{"cre?",HB_TAG('Y','C','R',' ')},*//* Y-Cree */
> 
> "crk"
> 
> /*{"??",HB_TAG('Y','I','C',' ')},*//* Yi Classic */
> 
> Leave unmapped
> 
> /*{"ii?/Yiii?",HB_TAG('Y','I','M',' ')},*//* Yi Modern */
> 
> "ii"
> 
> It would also be desirable to map otherwise unmapped languages in the
> Yi script (ie with with a script code of Yiii) to YIM.
> 
> /*{"??",HB_TAG('Z','H','P',' ')},*//* Chinese Phonetic */
> 
> "zh-Latn"
> 
> I'll have some more general comments later.
> 
> James
> 
> 
> 
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> 

-- 
behdad
http://behdad.org/


More information about the HarfBuzz mailing list