[HarfBuzz] SEA Syllable Structure and Tai Tham Combining Classes
Richard Wordingham
richard.wordingham at ntlworld.com
Mon Jun 10 16:31:06 PDT 2013
Dear List,
Harfbuzz (at least, at 0.9.18) SEA shaping for Tai Tham currently falls
foul of the pointless assignment of ccc=230 to the Tai Tham
tone marks. The first result is that a sequence of <tone-mark, sakot,
consonant>, which belongs to the syllable tail according to
hb-ot-shape-complex-sea-machine.rl, is normalised to <sakot, tone-mark,
consonant >, which does not. The consequence of this is that the word
<HIGH KA, SIGN I, TONE-1, SAKOT, NGA>, which does not need any reordering,
gets reordered to <HIGH KA, SIGN I, SAKOT, *COMBINING CIRCLE*, TONE-1,
NGA>.
This problem does not show up if using Unicode data older than 5.2
(October 2009) and using the recommended character order.
I know of two possible solutions; I have tried both out.
Solution 1:
The first solution is to change the S.E. Asian syllable
structure in hb-ot-shape-complex-sea-machine.rl from having
syllable_tail = (VPre|VAbv|VBlw|VPst|H.C|CM|MR|T|A)*;
to having
syllable_tail = (VPre|VAbv|VBlw|VPst|H.C|H.T.C|H.T.T.C|CM|MR|T|A)*;
Two tones can occur in succession when a word with chained syllables is
encoded according to its appearance rather than its morphology.
This still leaves the need for complex look-ups to ligate the sakot and
consonant in glyph sequence such as <sakot, tone mark, consonant> to,
preferably, <tone mark, sakot+consonant>.
Anyone trying out this solution should ensure that they have Ragel
available to translate hb-ot-shape-complex-sea-machine.rl to
hb-ot-shape-complex-sea-machine.hh.
Solution 2:
The second solution is to internally change the canonical classes so
that the Tai Tham tone-marks (ccc=230) are ordered before U+1A60 TAI
THAM SIGN SAKOT (ccc=9). The way I did this was to change the
function modified_combining_class() in hb-unicode-private.hh from
unsigned int
modified_combining_class (hb_codepoint_t unicode)
{
/* XXX This hack belongs to the Myanmar shaper. */
if (unicode == 0x1037) unicode = 0x103A;
return _hb_modified_combining_class[combining_class (unicode)];
}
to
unsigned int
modified_combining_class (hb_codepoint_t unicode)
{
/* XXX This hack belongs to the Myanmar shaper. */
if (unicode == 0x1037) unicode = 0x103A;
/* XXX This hack belongs to the SEA shaper. */
if (unicode == 0x1a60) unicode = 0x0345;
return _hb_modified_combining_class[combining_class (unicode)];
}
Changing a canonical class of 9 (the virama-marker) to something high
is another solution. However, that might have undesirable effects
on any Thai script fonts that enlarge U+0E3A THAI CHARACTER PHINTHU
when it is used as a vowel diacritic in minority languages. Martin
Hosken might be able to advise.
With Solution 2, SAKOT will end up next to its following consonant, and
I therefore recommend Solution 2.
Richard.
More information about the HarfBuzz
mailing list