[HarfBuzz] Dotted Circles in Tai Tham

Mon Feb 23 16:31:54 PST 2015

On Sun, 1 Feb 2015 01:39:42 +0000
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> I've been having some problems with spurious dotted circles in various
> versions of HarfBuzz, and I thought I would share before proposing a
> complete solution to Behdad.

Well, no-one has shown any interest, so I will go ahead with my
proposals/requests.  For ease of reference, I have deleted little from
my original post.

> I've been looking at 3 versions of HarfBuzz:

> 'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz
> is in that version of LibreOffice.

When checking the version later, I saw 'LibreOffice 4.3.3.2', so it's
possible LbreOffice 4.3.4 is different.

> 'HarfBuzz 0.9.38+', i.e. the latest sources at some time today.

Some time on Saturday 31 January 2015 might be more precise.

> 'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable
> Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post
> 'Indic Syllable Categories'
> http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html).

> These categories are defined in HarfBuzz by file
> hb-ot-shape-complex-indic-table.cc.  I was about to formally submit my
> suggestions to the Unicode Technical Committee, but then I discovered
> that the changes would adversely affect HarfBuzz.

> The first problem arose with U+1A7B MAI SAM.  While there
> is no problem with its uses to indicate word (or phrase) repetition by
> marking the last akshara and to indicate the merger of two 1-consonant
> vowelless consonant stacks, a dotted circle occurs in the example
> example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI
> SAM, U+1A6B SIGN O, U+1A41 RA>.  The problem is that MAI SAM has an
> ISC of 'other', so U+25CC in inserted before SIGN O.  Making MAI SAM a
> 'dependent vowel' as I had suggested fixed this problem.

> The second problem arose with U+1A7A RA HAAM, and could also arise
> with U+1A7C KARAN.  The problem is that with the influx of foreign
> loans into Thai, in Thailand there are now clusters of two consonants
> in which the *first* consonant cluster is silent.  In most cases,
> there is no way for Tai Tham to show which is silent, but when the
> tail of the second consonant rises to the hanging baseline, the
> placement of the cancellation marks tends to show which consonant is
> cancelled.  A (hpyothetical) example is the English surname 'Dawes',
> which is represented with three consonants in Thai.  The
> transliteration of 'w' is marked as silent.  Conversely, 'Howes'
> would be written with the transliteration of the 's' as silent.  This
> prevents the font deciding the placement of the cancellation mark on
> a cluster by cluster basis. Following the lead of Thai, this would be
> written <U+1A2F DA, U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM,
> U+1A60 SAKOT, U+1A48 HIGH SA>.

> LibreOffice 4.3.4 splits the cluster into three syllables, <WA,
> SAKOT>, <RA HAAM> and <HIGH SA>, and the problem is simply that the
> SAKOT>subscript
> form cannot be generated until after the syllable boundaries are
> dropped.  This is simply a variant of the font-soluble but for the
> future eliminated tone and SAKOT problem.
> 
> HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>,
> <RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of
> 'other'.  New ISC marks RA HAAM as a 'pure killer'.  Unfortunately,
> this does not change the misdeduced syllable structure.  I think the
> analysis needs to treat the sequence 'pure killer', 'invisible
> stacker' as being within a single syllable.  Is this too much to ask
> for?
> 
> The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC
> DOT, and possibly is not a real problem.  I have too few examples of
> the character's use.  CRYPTOGRAMMIC DOT currently has an ISC of
> 'other', so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence
> <U+1A49 HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three
> syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>.  It
> is possible that the input sequence will not occur in the wild.  In
> 'New ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the
> sequence is treated as a single syllable, as desired.

The first (MAI SAM) and third (CRYPTOGRAMMIC DOT) problems are solved
by recategorising the characters as interior members of the syllable,
with Indic Syllabic categories 'matra' and 'nukta' respectively.  I
recommend that HarfBuzz make these changes, in the file
hb-ot-shape-complex-indic-table.cc.

MAI SAM is an odd matra, as its placement is determined by its phonetic
role, not its visual position.  It is worth noting that this mark is a
superscript version of U+1A91 TAI THAM THAM DIGIT TWO, and that its
core meaning is that there are two of something, not just one of
something.  A classification as 'Consonant_medial' would work just as
well.

For the second problem, an analogue can be found in the Khmer
sequences <U+179C KHMER LETTER VO, U+17CD KHMER SIGN TOANDAKHIAT,
U+17D2 KHMER SING COENG, U+179F KHMER LETTER SA> and its anagram
<U+179C, U+17D2, U+179F, U+17CD>, which render without complaint and
slightly differently in both HarfBuzz and Windows 7 (other versions not
tested).

Now, at present, U+17CD is classified as 'Vowel_Dependent' by an
explicit override in gen-indic-table.py, the generator of
hb-ot-shape-complex-indic-table.cc.  The same treatment would suffice
for U+1A7A RA HAAM and U+1A7C KARAN.

> The next problem was with the admittedly unusual writing <U+1A93 THAM
> DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'.  None of the
> three versions allowed the digit to be treated as a consonant base,
> and so U+25CC was introduced before SAKOT.  Does the SEA engine need
> to be specifically instructed to treat Tai Tham decimal numbers as
> potential character bases?

The answer, I see, is that it does need to be so instructed.

> Some of my changes for 'New ISC' had bad consequences.  Changing
> U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel
> resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into
> two syllables, <LOW CA, SAKOT> and <LAE>.  While the font can work
> round this, this is not good.

Occasional subscripting of ancient independent vowels has been
reported, and I think HarfBuzz should support this behaviour.

> Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to
> 'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75
> TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC,
> TONE-1>.  This seems odd;  U+0ECD LAO NIGGAHITA is classified by
> Unicode as 'bindu', yet regularly has tone marks mounted on it.  Is
> the syllable splitting here a HarfBuzz error?

The problem with anusvara (Indic syllabic category  'bindu') is that
there are two types - those that terminate the syllable (a subgroup of
Indic syllabic category type OT_SM), and those that are more matra-like
(the rare category type OT_A).  The file
hb-ot-shape-complex-indic-table.cc maps them to the category type
OT_SM, but the SEA syllable analyser is set up for category OT_A.  
At present, assignments to Indic category OT_A are done by
executable code checking the character codes, and many of the
characters in this group are in fact Vedic tone marks!

I think this is an area where HarfBuzz will just have to override the
Unicode settings - the general categorisations don't help with layout
constraints.

Richard.