[HarfBuzz] SEA Syllable Structure and Tai Tham Combining Classes

Mon Jun 10 16:31:06 PDT 2013

Dear List,

Harfbuzz (at least, at 0.9.18) SEA shaping for Tai Tham currently falls
foul of the pointless assignment of ccc=230 to the Tai Tham
tone marks.  The first result is that a sequence of <tone-mark, sakot,
consonant>, which belongs to the syllable tail according to
hb-ot-shape-complex-sea-machine.rl, is normalised to <sakot, tone-mark,
consonant >, which does not.  The consequence of this is that the word
<HIGH KA, SIGN I, TONE-1, SAKOT, NGA>, which does not need any reordering,
gets reordered to <HIGH KA, SIGN I, SAKOT, *COMBINING CIRCLE*, TONE-1,
NGA>.

This problem does not show up if using Unicode data older than 5.2
(October 2009) and using the recommended character order.

I know of two possible solutions; I have tried both out.

Solution 1:

The first solution is to change the S.E. Asian syllable
structure in hb-ot-shape-complex-sea-machine.rl from having

    syllable_tail = (VPre|VAbv|VBlw|VPst|H.C|CM|MR|T|A)*;

to having

    syllable_tail = (VPre|VAbv|VBlw|VPst|H.C|H.T.C|H.T.T.C|CM|MR|T|A)*;

Two tones can occur in succession when a word with chained syllables is
encoded according to its appearance rather than its morphology.

This still leaves the need for complex look-ups to ligate the sakot and
consonant in glyph sequence such as <sakot, tone mark, consonant> to,
preferably, <tone mark, sakot+consonant>.

Anyone trying out this solution should ensure that they have Ragel
available to translate hb-ot-shape-complex-sea-machine.rl to
hb-ot-shape-complex-sea-machine.hh.

Solution 2:

The second solution is to internally change the canonical classes so
that the Tai Tham tone-marks (ccc=230) are ordered before U+1A60 TAI
THAM SIGN SAKOT (ccc=9).  The way I did this was to change the
function modified_combining_class() in hb-unicode-private.hh from

  unsigned int
  modified_combining_class (hb_codepoint_t unicode)
  {
    /* XXX This hack belongs to the Myanmar shaper. */
    if (unicode == 0x1037) unicode = 0x103A;

    return _hb_modified_combining_class[combining_class (unicode)];
  }

to

  unsigned int
  modified_combining_class (hb_codepoint_t unicode)
  {
    /* XXX This hack belongs to the Myanmar shaper. */
    if (unicode == 0x1037) unicode = 0x103A;
    /* XXX This hack belongs to the SEA shaper. */
    if (unicode == 0x1a60) unicode = 0x0345;

    return _hb_modified_combining_class[combining_class (unicode)];
  }

Changing a canonical class of 9 (the virama-marker) to something high
is another solution.  However, that might have undesirable effects
on any Thai script fonts that enlarge U+0E3A THAI CHARACTER PHINTHU
when it is used as a vowel diacritic in minority languages.  Martin
Hosken might be able to advise.

With Solution 2, SAKOT will end up next to its following consonant, and
I therefore recommend Solution 2.

Richard.