<div dir="ltr">A few more general comments.<div><br></div><div>1. The structure HarfBuzz is using to represent the mapping from IETF language tags to OpenType language system tags works fine for language tags as defined by RFC 3066, but now that RFC 3066 has been obsoleted by RFC 4646 and now RFC 5646, it is no longer sufficient. As before, Chinese is the most complicated case, and I have described the problems as they apply to Chinese here</div> <div><br></div><div><a href="https://github.com/jclark/lang-ietf-opentype/blob/master/doc/chinese.md">https://github.com/jclark/lang-ietf-opentype/blob/master/doc/chinese.md</a><br></div><div><br></div><div>but there are also non-Chinese tags where subtags needs to be considered (notably el-polyton, ga-Latg, ka-Geok, ar-Syrc).</div> <div><br></div><div>One way to deal with this would be to have the tables look like this</div><div><br></div><div>typedef struct {</div><div> char subtag[8];</div><div> hb_tag_t ot_tag;</div><div>} LangSubtag;</div><div> <br></div><div>typedef struct {</div><div> hb_tag_t initial_tag;</div><div> hb_tag_t ot_tag;</div><div> const LangSubtag *subtags;</div><div>} LangInitialTag;</div><div><br></div><div>static const zh_lang[] = {</div><div> // ...</div><div><div> {"hk", HB_TAG('Z','H','H',' ')},</div></div><div> {"hant", HB_TAG('Z','H','T',' ')},</div><div> // ...<br></div> <div> "" // mark end of list</div><div>};</div><div><br></div><div>static const LangInitialTag ot_languages = {</div><div> // ...</div><div> // Third field can be omitted for most tags.</div><div> {HB_TAG('e','n',' ',' '), HB_TAG('E','N','G',' ') },</div> <div> // ...</div><div><div> {HB_TAG('y','u','e',' '), HB_TAG('Z','H','S',' '), zh_lang},</div></div><div> // ...</div><div> {HB_TAG('z','h',' ',' '), HB_TAG('Z','H','S',' '), zh_lang},</div> <div> // ...</div><div>};</div><div><br></div><div>This still leaves the "fonipa" (IPA) variant tag to be handled in code, which is not ideal, but I haven't found a good way to deal with this declaratively.</div> <div><br></div><div>2. Since all the inputs (the ISO 639-2/3/5 registries, the IETF language registry and the OpenType spec) used for generating the language mapping table change from time to time, I think it would improve maintainability to generate the table completely automatically, with the various tweaks that are needed being included in the generating program, rather than applying the tweaks manually to the program output. I have take this approach here:</div> <div><br></div><div><a href="https://github.com/jclark/lang-ietf-opentype/blob/master/gen/gen.js">https://github.com/jclark/lang-ietf-opentype/blob/master/gen/gen.js</a><br></div><div><br></div><div>When you have had a chance to consider my proposed changes in the previous email in detail, I would be happy to add an option to make the output correspond to the changes that you decide to accept (I am not expecting you to agree with all my proposed changes -- there is scope for reasonable people to disagree), and even generate the output in a format suitable to #include'd in hb-ot-tag.cc.</div> <div><br></div><div>3. In some cases, there are multiple OT langsys tags to which an IETF language tag could be mapped. Sometimes this is because the OT tag definition is not clear, sometimes it's because OT tags represent variants for which there is no IETF language tags, and sometimes it's because an OT tag represents an individual language that is part of a language group/macrolanguage represented by another OT tag. This makes me wonder whether it would be better/more robust to map an IETF language tag to an ordered list of OT langsys tags, and then HB would use the first langsys tag that the font supports. However, I am not sure it is worth the effort/complication, since most of these cases are pretty obscure, and it's easy and efficient for fonts to make multiple langsys tags behave the same.</div> <div><br></div><div>James</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Mar 9, 2014 at 12:58 AM, Behdad Esfahbod <span dir="ltr"><<a href="mailto:behdad@behdad.org" target="_blank">behdad@behdad.org</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Wow. Thanks James! Will study in detail. Roozbeh: we should that together<br> perhaps.<br> <div><div class="h5"><br> On 14-03-08 03:17 AM, James Clark wrote:<br> > For my own project, I needed to implement mapping from IETF language tags to<br> > OpenType language system tags. I ended up writing some code to generate the<br> > mapping and then comparing the results with HarfBuzz. For each case where<br> > there was a discrepancy, I did enough research to convince myself of the right<br> > result. The HB source refers to a recent Microsoft draft, from which some<br> > entries have been added; I skipped these entries (which I assume are similar<br> > to the ones in the ISO 3rd ed WD 5, which I found<br> > here <a href="http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition" target="_blank">http://mpeg.chiariglione.org/standards/mpeg-4/open-font-format/text-wd-isoiec-14496-22-3rd-edition</a>).<br> ><br> > I documented the research here<br> ><br> > <a href="https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md" target="_blank">https://github.com/jclark/lang-ietf-opentype/blob/master/doc/notes.md</a><br> ><br> > As a result I have a lot of comments about HarfBuzz's implementation.<br> ><br> > First some stuff that is just typos.<br> ><br> > "ber" should be mapped to BBR not BER.<br> ><br> > There's a duplicate entry for "hz" not in sort order.<br> ><br> > The entries for "sck", "vls", "wo" are not in sort order.<br> ><br> > The tag for "tmh" is in lower case instead of upper case.<br> ><br> > Some tags are missing a final zero. The ISO WD adds some 4-character tags,<br> > whose last character is a zero. There are four cases where these have been<br> > added, but the final zero was incorrectly omitted: kab -> KAB0, ksh -> KSH0,<br> > kg -> KON0, pap -> PAP0, sn -> SNA0.<br> ><br> > The following entries appear in the spec, but are missing from HarfBuzz, and<br> > they seem uncontroversial to me.<br> ><br> > wlc CMR Mwali Comorian<br> > wni CMR Ndzwani Comorian<br> > zdj CMR Ngazidja Comorian<br> > caf CRR Southern Carrier<br> > co COS Corsican<br> ><br> > The last is probably missing because it was omitted from the ISO WD; I suspect<br> > this is a bug in the ISO WD.<br> ><br> > HarfBuzz (and the OT spec) are inconsistent in their handling of<br> > macrolanguages. Sometimes when an IETF macrolanguage is mapped to an OT lang,<br> > they also map the individual languages encompassed by the macrolanguage to<br> > that OT tag and sometimes they don't. I would suggest that the consistent and<br> > reasonable policy is always to map the individual languages to the same OT tag<br> > as the macrolanguage, unless the individual language is separately mapped to a<br> > more specific OT tag. I created a file with the additional entries that would<br> > be needed to implement this policy in HarfBuzz:<br> ><br> > <a href="https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt" target="_blank">https://github.com/jclark/lang-ietf-opentype/blob/master/gen/hb-macrolang-expand.txt</a><br> ><br> > The rest of my comments are not self-evident. You will need to refer to the<br> > notes I linked to above for my reasoning.<br> ><br> > My first set of removal/additions is in accordance with the ISO 639 codes in<br> > the spec. I suggest removing these mappings:<br> ><br> > eot BTI Beti (Côte d'Ivoire)<br> > kvd KUI Kui (Indonesia)<br> > mdc MLE Male (Papua New Guinea)<br> > mlq MNK Western Maninkakan<br> > nco SIB Sibe<br> > ril RIA Riang (India)<br> > xom KMO Komo (Sudan)<br> > yso NIS Nisi (China)<br> ><br> > and adding these:<br> ><br> > sjo SIB Xibe<br> > pro PRO Old Provencal<br> > rmz ARK Marma<br> ><br> > The next set is not in the spec. Remove:<br> ><br> > xst SIG (not an IETF tag, was Silt'e in ISO 639-2 before it was retired)<br> ><br> > and add:<br> ><br> > njz NIS Nyishi<br> > tgj NIS Tagin<br> > beb BTI Bebele<br> > bum BTI Bulu (Cameroon)<br> > bxp BTI Bebil<br> > eto BTI Eton (Cameroon)<br> > ewo BTI Ewondo<br> > fan BTI Fang (Equatorial Guinea)<br> > mct BTI Mengisa<br> ><br> > Finally I have suggestions the commented out entries in the source:<br> ><br> </div></div>> /*{"ahg/awn/xan?",HB_TAG('A','G','W',' ')},*//* Agaw */<br> ><br> > "ahg", "awn"<br> ><br> > /*{"gsw?/gsw-FR?",HB_TAG('A','L','S',' ')},*//* Alsatian */<br> ><br> > "gsw"<br> ><br> > /*{"krc",HB_TAG('B','A','L',' ')},*//* Balkar */<br> ><br> > Leave unmapped<br> ><br> > /*{"??",HB_TAG('B','C','R',' ')},*//* Bible Cree */<br> ><br> > Leave unmapped<br> ><br> > /*{"zh?",HB_TAG('C','H','N',' ')},*//* Chinese (seen in Microsoft fonts) */<br> ><br> > ???<br> ><br> > /*{"acf/gcf?",HB_TAG('F','A','N',' ')},*//* French Antillean */<br> ><br> > "acf", "gcf"<br> ><br> > /*{"enf?/yrk?",HB_TAG('F','N','E',' ')},*//* Forest Nenets */<br> ><br> > Leave unmapped<br> ><br> > /*{"fuf?",HB_TAG('F','T','A',' ')},*//* Futa */<br> ><br> > "fuf"<br> ><br> > /*{"ar-Syrc?",HB_TAG('G','A','R',' ')},*//* Garshuni */<br> ><br> > "ar-Syrc"<br> ><br> > /*{"cfm/rnl?",HB_TAG('H','A','L',' ')},*//* Halam */<br> ><br> > "cfm"<br> ><br> > /*{"fonipa",HB_TAG('I','P','P','H')},*//* Phonetic transcription—IPA<br> <div class="">> conventions */<br> ><br> > "und-fonipa", or better map anything with a variant of "fonipa"<br> ><br> </div>> /*{"ga-Latg?/Latg?",HB_TAG('I','R','T',' ')},*//* Irish Traditional */<br> ><br> > "ga-Latg"<br> ><br> > /*{"krc",HB_TAG('K','A','R',' ')},*//* Karachay */<br> ><br> > "krc"<br> ><br> > /*{"alw?/ktb?",HB_TAG('K','E','B',' ')},*//* Kebena */<br> ><br> > "alw"<br> ><br> > /*{"Geok",HB_TAG('K','G','E',' ')},*//* Khutsuri Georgian */<br> <div class="">><br> > "ka-Geok" (Georgian written with the Khutsuri script)<br> ><br> </div>> /*{"kca",HB_TAG('K','H','K',' ')},*//* Khanty-Kazim */<br> ><br> > "kca"<br> ><br> > /*{"kca",HB_TAG('K','H','S',' ')},*//* Khanty-Shurishkar */<br> ><br> > Leave unmapped<br> ><br> > /*{"kca",HB_TAG('K','H','V',' ')},*//* Khanty-Vakhi */<br> ><br> > Leave unmapped<br> ><br> > /*{"guz?/kqs?/kss?",HB_TAG('K','I','S',' ')},*//* Kisii */<br> ><br> > "guz"<br> ><br> > /*{"kfa/kfi?/kpb?/xua?/xuj?",HB_TAG('K','O','D',' ')},*//* Kodagu */<br> ><br> > "kfa"<br> ><br> > /*{"okm?/oko?",HB_TAG('K','O','H',' ')},*//* Korean Old Hangul */<br> ><br> > "okm"<br> ><br> > /*{"kon?/ktu?/...",HB_TAG('K','O','N',' ')},*//* Kikongo */<br> ><br> > "ktu"<br> ><br> > /*{"kfx?",HB_TAG('K','U','L',' ')},*//* Kulvi */<br> ><br> > "kfx"<br> ><br> > /*{"??",HB_TAG('L','A','H',' ')},*//* Lahuli */<br> <div class="">><br> > "lbf", "lae", "bfu"<br> ><br> </div>> /*{"??",HB_TAG('L','C','R',' ')},*//* L-Cree */<br> ><br> > Leave unmapped<br> ><br> > /*{"??",HB_TAG('M','A','L',' ')},*//* Malayalam Traditional */<br> ><br> > Leave unmapped<br> ><br> > /*{"mnk?/mlq?/...",HB_TAG('M','L','N',' ')},*//* Malinke */<br> ><br> > "mlq"<br> ><br> > /*{"??",HB_TAG('N','C','R',' ')},*//* N-Cree */<br> ><br> > "csw"<br> ><br> > /*{"??",HB_TAG('N','H','C',' ')},*//* Norway House Cree */<br> ><br> > Leave unmapped<br> ><br> > /*{"jpa?/sam?",HB_TAG('P','A','A',' ')},*//* Palestinian Aramaic */<br> ><br> > "jpa", "sam"<br> ><br> > /*{"polyton",HB_TAG('P','G','R',' ')},*//* Polytonic Greek */<br> ><br> > "el-polyton"<br> ><br> > /*{"??",HB_TAG('Q','I','N',' ')},*//* Asho Chin */<br> <div class="">><br> > "tbq"<br> ><br> > (The spec says Chin not Asho Chin.)<br> ><br> </div>> /*{"??",HB_TAG('R','C','R',' ')},*//* R-Cree */<br> ><br> > "atj"<br> ><br> > /*{"chp?",HB_TAG('S','A','Y',' ')},*//* Sayisi */<br> ><br> > Leave unmapped<br> ><br> > /*{"xan?",HB_TAG('S','E','K',' ')},*//* Sekota */<br> ><br> > "xan"<br> ><br> > /*{"ngo?",HB_TAG('S','X','T',' ')},*//* Sutu */<br> ><br> > Leave unmapped<br> ><br> > /*{"??",HB_TAG('T','C','R',' ')},*//* TH-Cree */<br> ><br> > Leave unmapped<br> ><br> > /*{"tnz?/tog?/toi?",HB_TAG('T','N','G',' ')},*//* Tonga */<br> ><br> > "toi"<br> ><br> > /*{"enh?/yrk?",HB_TAG('T','N','E',' ')},*//* Tundra Nenets */<br> ><br> > "yrk"<br> ><br> > /*{"??",HB_TAG('W','C','R',' ')},*//* West-Cree */<br> ><br> > Leave unmapped<br> ><br> > /*{"cre?",HB_TAG('Y','C','R',' ')},*//* Y-Cree */<br> ><br> > "crk"<br> ><br> > /*{"??",HB_TAG('Y','I','C',' ')},*//* Yi Classic */<br> ><br> > Leave unmapped<br> ><br> > /*{"ii?/Yiii?",HB_TAG('Y','I','M',' ')},*//* Yi Modern */<br> <div class="">><br> > "ii"<br> ><br> > It would also be desirable to map otherwise unmapped languages in the<br> > Yi script (ie with with a script code of Yiii) to YIM.<br> ><br> </div>> /*{"??",HB_TAG('Z','H','P',' ')},*//* Chinese Phonetic */<br> <div class="">><br> > "zh-Latn"<br> ><br> > I'll have some more general comments later.<br> ><br> > James<br> ><br> ><br> ><br> </div>> _______________________________________________<br> > HarfBuzz mailing list<br> > <a href="mailto:HarfBuzz@lists.freedesktop.org">HarfBuzz@lists.freedesktop.org</a><br> > <a href="http://lists.freedesktop.org/mailman/listinfo/harfbuzz" target="_blank">http://lists.freedesktop.org/mailman/listinfo/harfbuzz</a><br> ><br> <span class="HOEnZb"><font color="#888888"><br> --<br> behdad<br> <a href="http://behdad.org/" target="_blank">http://behdad.org/</a><br> </font></span></blockquote></div><br></div>