[HarfBuzz] hb_ot_tag_to_language

James Clark jjc at jclark.com
Sun Mar 9 19:04:06 PDT 2014


A few more general comments.

1. The structure HarfBuzz is using to represent the mapping from IETF
language tags to OpenType language system tags works fine for language tags
as defined by RFC 3066, but now that RFC 3066 has been obsoleted by RFC
4646 and now RFC 5646, it is no longer sufficient.  As before, Chinese is
the most complicated case, and I have described the problems as they apply
to Chinese here

https://github.com/jclark/lang-ietf-opentype/blob/master/doc/chinese.md

but there are also non-Chinese tags where subtags needs to be considered
(notably el-polyton, ga-Latg, ka-Geok, ar-Syrc).

One way to deal with this would be to have the tables look like this

typedef struct {
  char subtag[8];
  hb_tag_t ot_tag;
} LangSubtag;

typedef struct {
  hb_tag_t initial_tag;
  hb_tag_t ot_tag;
  const LangSubtag *subtags;
} LangInitialTag;

static const zh_lang[] = {
  // ...
  {"hk", HB_TAG('Z','H','H',' ')},
  {"hant", HB_TAG('Z','H','T',' ')},
  // ...
  "" // mark end of list
};

static const LangInitialTag ot_languages = {
  // ...
  // Third field can be omitted for most tags.
  {HB_TAG('e','n',' ',' '), HB_TAG('E','N','G',' ') },
  // ...
  {HB_TAG('y','u','e',' '), HB_TAG('Z','H','S',' '), zh_lang},
  // ...
  {HB_TAG('z','h',' ',' '), HB_TAG('Z','H','S',' '), zh_lang},
  // ...
};

This still leaves the "fonipa" (IPA) variant tag to be handled in code,
which is not ideal, but I haven't found a good way to deal with this
declaratively.

2. Since all the inputs (the ISO 639-2/3/5 registries, the IETF language
registry and the OpenType spec)  used for generating the language mapping
table change from time to time, I think it would improve maintainability to
generate the table completely automatically, with the various tweaks that
are needed being included in the generating program, rather than applying
the tweaks manually to the program output.  I have take this approach here:

https://github.com/jclark/lang-ietf-opentype/blob/master/gen/gen.js

When you have had a chance to consider my proposed changes in the previous
email in detail, I would be happy to add an option to make the output
correspond to the changes that you decide to accept (I am not expecting you
to agree with all my proposed changes -- there is scope for reasonable
people to disagree), and even generate the output in a format suitable to
#include'd in hb-ot-tag.cc.

3. In some cases, there are multiple OT langsys tags to which an IETF
language tag could be mapped. Sometimes this is because the OT tag
definition is not clear, sometimes it's because OT tags represent variants
for which there is no IETF language tags, and sometimes it's because an OT
tag represents an individual language that is part of a language
group/macrolanguage represented by another OT tag. This makes me wonder
whether it would be better/more robust to map an IETF language tag to an
ordered list of OT langsys tags, and then HB would use the first langsys
tag that the font supports. However, I am not sure it is worth the
effort/complication, since most of these cases are pretty obscure, and it's
easy and efficient for fonts to make multiple langsys tags behave the same.

James


On Sun, Mar 9, 2014 at 12:58 AM, Behdad Esfahbod <behdad at behdad.org> wrote:

> Wow.  Thanks James!  Will study in detail.  Roozbeh: we should that
> together
> perhaps.
>
> On 14-03-08 03:17 AM, James Clark wrote:
> > For my own project, I needed to implement mapping from IETF language
> tags to
> > OpenType language system tags.  I ended up writing some code to generate
> the
> > mapping and then comparing the results with HarfBuzz.  For each case
> where
> > there was a discrepancy, I did enough research to convince myself of the
> right
> > result.  The HB source refers to a recent Microsoft draft, from which
> some
> > entries have been added; I skipped these entries (which I assume are
> similar
> > to the ones in the ISO 3rd ed WD 5, which I found
> > here
> http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition
> ).
> >
> > I documented the research here
> >
> > https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md
> >
> > As a result I have a lot of comments about HarfBuzz's implementation.
> >
> > First some stuff that is just typos.
> >
> > "ber" should be mapped to BBR not BER.
> >
> > There's a duplicate entry for "hz" not in sort order.
> >
> > The entries for "sck", "vls", "wo" are not in sort order.
> >
> > The tag for "tmh" is in lower case instead of upper case.
> >
> > Some tags are missing a final zero. The ISO WD adds some 4-character
> tags,
> > whose last character is a zero.  There are four cases where these have
> been
> > added, but the final zero was incorrectly omitted: kab -> KAB0, ksh ->
> KSH0,
> > kg -> KON0, pap -> PAP0, sn -> SNA0.
> >
> > The following entries appear in the spec, but are missing from HarfBuzz,
> and
> > they seem uncontroversial to me.
> >
> > wlc CMR Mwali Comorian
> > wni CMR Ndzwani Comorian
> > zdj CMR Ngazidja Comorian
> > caf CRR Southern Carrier
> > co COS Corsican
> >
> > The last is probably missing because it was omitted from the ISO WD; I
> suspect
> > this is a bug in the ISO WD.
> >
> > HarfBuzz (and the OT spec) are inconsistent in their handling of
> > macrolanguages.  Sometimes when an IETF macrolanguage is mapped to an OT
> lang,
> > they also map the individual languages encompassed by the macrolanguage
> to
> > that OT tag and sometimes they don't.  I would suggest that the
> consistent and
> > reasonable policy is always to map the individual languages to the same
> OT tag
> > as the macrolanguage, unless the individual language is separately
> mapped to a
> > more specific OT tag. I created a file with the additional entries that
> would
> > be needed to implement this policy in HarfBuzz:
> >
> >
> https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt
> >
> > The rest of my comments are not self-evident.  You will need to refer to
> the
> > notes I linked to above for my reasoning.
> >
> > My first set of removal/additions is in accordance with the ISO 639
> codes in
> > the spec. I suggest removing these mappings:
> >
> > eot BTI Beti (Côte d'Ivoire)
> > kvd KUI Kui (Indonesia)
> > mdc MLE Male (Papua New Guinea)
> > mlq MNK Western Maninkakan
> > nco SIB Sibe
> > ril RIA Riang (India)
> > xom KMO Komo (Sudan)
> > yso NIS Nisi (China)
> >
> > and adding these:
> >
> > sjo SIB Xibe
> > pro PRO Old Provencal
> > rmz ARK Marma
> >
> > The next set is not in the spec.  Remove:
> >
> > xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired)
> >
> > and add:
> >
> > njz NIS Nyishi
> > tgj NIS Tagin
> > beb BTI Bebele
> > bum BTI Bulu (Cameroon)
> > bxp BTI Bebil
> > eto BTI Eton (Cameroon)
> > ewo BTI Ewondo
> > fan BTI Fang (Equatorial Guinea)
> > mct BTI Mengisa
> >
> > Finally I have suggestions the commented out entries in the source:
> >
> > /*{"ahg/awn/xan?",HB_TAG('A','G','W',' ')},*//* Agaw */
> >
> > "ahg", "awn"
> >
> > /*{"gsw?/gsw-FR?",HB_TAG('A','L','S',' ')},*//* Alsatian */
> >
> > "gsw"
> >
> > /*{"krc",HB_TAG('B','A','L',' ')},*//* Balkar */
> >
> > Leave unmapped
> >
> > /*{"??",HB_TAG('B','C','R',' ')},*//* Bible Cree */
> >
> > Leave unmapped
> >
> > /*{"zh?",HB_TAG('C','H','N',' ')},*//* Chinese (seen in Microsoft fonts)
> */
> >
> > ???
> >
> > /*{"acf/gcf?",HB_TAG('F','A','N',' ')},*//* French Antillean */
> >
> > "acf", "gcf"
> >
> > /*{"enf?/yrk?",HB_TAG('F','N','E',' ')},*//* Forest Nenets */
> >
> > Leave unmapped
> >
> > /*{"fuf?",HB_TAG('F','T','A',' ')},*//* Futa */
> >
> > "fuf"
> >
> > /*{"ar-Syrc?",HB_TAG('G','A','R',' ')},*//* Garshuni */
> >
> > "ar-Syrc"
> >
> > /*{"cfm/rnl?",HB_TAG('H','A','L',' ')},*//* Halam */
> >
> > "cfm"
> >
> > /*{"fonipa",HB_TAG('I','P','P','H')},*//* Phonetic transcription—IPA
> > conventions */
> >
> > "und-fonipa", or better map anything with a variant of "fonipa"
> >
> > /*{"ga-Latg?/Latg?",HB_TAG('I','R','T',' ')},*//* Irish Traditional */
> >
> > "ga-Latg"
> >
> > /*{"krc",HB_TAG('K','A','R',' ')},*//* Karachay */
> >
> > "krc"
> >
> > /*{"alw?/ktb?",HB_TAG('K','E','B',' ')},*//* Kebena */
> >
> > "alw"
> >
> > /*{"Geok",HB_TAG('K','G','E',' ')},*//* Khutsuri Georgian */
> >
> > "ka-Geok" (Georgian written with the Khutsuri script)
> >
> > /*{"kca",HB_TAG('K','H','K',' ')},*//* Khanty-Kazim */
> >
> > "kca"
> >
> > /*{"kca",HB_TAG('K','H','S',' ')},*//* Khanty-Shurishkar */
> >
> > Leave unmapped
> >
> > /*{"kca",HB_TAG('K','H','V',' ')},*//* Khanty-Vakhi */
> >
> > Leave unmapped
> >
> > /*{"guz?/kqs?/kss?",HB_TAG('K','I','S',' ')},*//* Kisii */
> >
> > "guz"
> >
> > /*{"kfa/kfi?/kpb?/xua?/xuj?",HB_TAG('K','O','D',' ')},*//* Kodagu */
> >
> > "kfa"
> >
> > /*{"okm?/oko?",HB_TAG('K','O','H',' ')},*//* Korean Old Hangul */
> >
> > "okm"
> >
> > /*{"kon?/ktu?/...",HB_TAG('K','O','N',' ')},*//* Kikongo */
> >
> > "ktu"
> >
> > /*{"kfx?",HB_TAG('K','U','L',' ')},*//* Kulvi */
> >
> > "kfx"
> >
> > /*{"??",HB_TAG('L','A','H',' ')},*//* Lahuli */
> >
> > "lbf", "lae", "bfu"
> >
> > /*{"??",HB_TAG('L','C','R',' ')},*//* L-Cree */
> >
> > Leave unmapped
> >
> > /*{"??",HB_TAG('M','A','L',' ')},*//* Malayalam Traditional */
> >
> > Leave unmapped
> >
> > /*{"mnk?/mlq?/...",HB_TAG('M','L','N',' ')},*//* Malinke */
> >
> > "mlq"
> >
> > /*{"??",HB_TAG('N','C','R',' ')},*//* N-Cree */
> >
> > "csw"
> >
> > /*{"??",HB_TAG('N','H','C',' ')},*//* Norway House Cree */
> >
> > Leave unmapped
> >
> > /*{"jpa?/sam?",HB_TAG('P','A','A',' ')},*//* Palestinian Aramaic */
> >
> > "jpa", "sam"
> >
> > /*{"polyton",HB_TAG('P','G','R',' ')},*//* Polytonic Greek */
> >
> > "el-polyton"
> >
> > /*{"??",HB_TAG('Q','I','N',' ')},*//* Asho Chin */
> >
> > "tbq"
> >
> > (The spec says Chin not Asho Chin.)
> >
> > /*{"??",HB_TAG('R','C','R',' ')},*//* R-Cree */
> >
> > "atj"
> >
> > /*{"chp?",HB_TAG('S','A','Y',' ')},*//* Sayisi */
> >
> > Leave unmapped
> >
> > /*{"xan?",HB_TAG('S','E','K',' ')},*//* Sekota */
> >
> > "xan"
> >
> > /*{"ngo?",HB_TAG('S','X','T',' ')},*//* Sutu */
> >
> > Leave unmapped
> >
> > /*{"??",HB_TAG('T','C','R',' ')},*//* TH-Cree */
> >
> > Leave unmapped
> >
> > /*{"tnz?/tog?/toi?",HB_TAG('T','N','G',' ')},*//* Tonga */
> >
> > "toi"
> >
> > /*{"enh?/yrk?",HB_TAG('T','N','E',' ')},*//* Tundra Nenets */
> >
> > "yrk"
> >
> > /*{"??",HB_TAG('W','C','R',' ')},*//* West-Cree */
> >
> > Leave unmapped
> >
> > /*{"cre?",HB_TAG('Y','C','R',' ')},*//* Y-Cree */
> >
> > "crk"
> >
> > /*{"??",HB_TAG('Y','I','C',' ')},*//* Yi Classic */
> >
> > Leave unmapped
> >
> > /*{"ii?/Yiii?",HB_TAG('Y','I','M',' ')},*//* Yi Modern */
> >
> > "ii"
> >
> > It would also be desirable to map otherwise unmapped languages in the
> > Yi script (ie with with a script code of Yiii) to YIM.
> >
> > /*{"??",HB_TAG('Z','H','P',' ')},*//* Chinese Phonetic */
> >
> > "zh-Latn"
> >
> > I'll have some more general comments later.
> >
> > James
> >
> >
> >
> > _______________________________________________
> > HarfBuzz mailing list
> > HarfBuzz at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> >
>
> --
> behdad
> http://behdad.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20140310/862449db/attachment-0001.html>


More information about the HarfBuzz mailing list