[HarfBuzz] Dotted Circles in Tai Tham

Sat Jan 31 17:39:42 PST 2015

I've been having some problems with spurious dotted circles in various
versions of HarfBuzz, and I thought I would share before proposing a
complete solution to Behdad.  I've been looking at 3 versions of
HarfBuzz:

'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz is
in that version of LibreOffice.  I know its old, because
its normalisation orders U+1A60 SAKOT before the tone
marks.  I have lookups in place to ameliorate that problem.

'HarfBuzz 0.9.38+', i.e. the latest sources at some time today.

'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable
Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post
'Indic Syllable Categories'
http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html). These
categories are defined in HarfBuzz by file
hb-ot-shape-complex-indic-table.cc.  I was about to formally submit my
suggestions to the Unicode Technical Committee, but then I discovered
that the changes would adversely affect HarfBuzz.

The first problem arose with U+1A7B MAI SAM.  While there
is no problem with its uses to indicate word (or phrase) repetition by
marking the last akshara and to indicate the merger of two 1-consonant
vowelless consonant stacks, a dotted circle occurs in the example
example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI
SAM, U+1A6B SIGN O, U+1A41 RA>.  The problem is that MAI SAM has an ISC
of 'other', so U+25CC in inserted before SIGN O.  Making MAI SAM a
'dependent vowel' as I had suggested fixed this problem.

The second problem arose with U+1A7A RA HAAM, and could also arise with
U+1A7C KARAN.  The problem is that with the influx of foreign loans
into Thai, in Thailand there are now clusters of two consonants in which
the *first* consonant cluster is silent.  In most cases, there is no
way for Tai Tham to show which is silent, but when the tail of the
second consonant rises to the hanging baseline, the placement of the
cancellation marks tends to show which consonant is cancelled.  A
(hpyothetical) example is the English surname 'Dawes', which is
represented with three consonants in Thai.  The transliteration of 'w'
is marked as silent.  Conversely, 'Howes' would be written with the
transliteration of the 's' as silent.  This prevents the font
deciding the placement of the cancellation mark on a cluster by cluster
basis. Following the lead of Thai, this would be written <U+1A2F DA,
U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM, U+1A60 SAKOT, U+1A48
HIGH SA>.

LibreOffice 4.3.4 splits the cluster into three syllables, <WA, SAKOT>,
<RA HAAM> and <HIGH SA>, and the problem is simply that the subscript
form cannot be generated until after the syllable boundaries are
dropped.  This is simply a variant of the font-soluble but for the
future eliminated tone and SAKOT problem.

HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>,
<RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of
'other'.  New ISC marks RA HAAM as a 'pure killer'.  Unfortunately,
this does not change the misdeduced syllable structure.  I think the
analysis needs to treat the sequence 'pure killer', 'invisible stacker'
as being within a single syllable.  Is this too much to ask for?

The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC
DOT, and possibly is not a real problem.  I have too few examples of
the character's use.  CRYPTOGRAMMIC DOT currently has an ISC of 'other',
so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence <U+1A49
HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three
syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>.  It is
possible that the input sequence will not occur in the wild.  In 'New
ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the sequence
is treated as a single syllable, as desired.

The next problem was with the admittedly unusual writing <U+1A93 THAM
DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'.  None of the
three versions allowed the digit to be treated as a consonant base, and
so U+25CC was introduced before SAKOT.  Does the SEA engine need to be
specifically instructed to treat Tai Tham decimal numbers as potential
character bases?

Some of my changes for 'New ISC' had bad consequences.  Changing
U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel
resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into
two syllables, <LOW CA, SAKOT> and <LAE>.  While the font can work
round this, this is not good.

Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to
'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75
TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC,
TONE-1>.  This seems odd;  U+0ECD LAO NIGGAHITA is classified by
Unicode as 'bindu', yet regularly has tone marks mounted on it.  Is the
syllable splitting here a HarfBuzz error?

Richard.