[HarfBuzz] Dotted Circles in Tai Tham
Behdad Esfahbod
behdad at behdad.org
Thu Feb 26 09:09:31 PST 2015
Hi Richard,
I was away for a few weeks. I'm glad you and Roozbeh got into discussion.
Working with him and Andrew is indeed the best way forward. Note that as you
observed, SEA is very liberal in what it accepts. That's simply because we
didn't know any better. We will transform SEA into a USE implementation.
What would immensely help is to gather sequences that you (and others) think
should be considered one syllable. We can then add these to Roozbeh's indic
repository as test data (with the USE grammar). That will be extremely
valuable, and I'm willing to set up the code to run the tests.
behdad
On 15-01-31 05:39 PM, Richard Wordingham wrote:
> I've been having some problems with spurious dotted circles in various
> versions of HarfBuzz, and I thought I would share before proposing a
> complete solution to Behdad. I've been looking at 3 versions of
> HarfBuzz:
>
> 'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz is
> in that version of LibreOffice. I know its old, because
> its normalisation orders U+1A60 SAKOT before the tone
> marks. I have lookups in place to ameliorate that problem.
>
> 'HarfBuzz 0.9.38+', i.e. the latest sources at some time today.
>
> 'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable
> Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post
> 'Indic Syllable Categories'
> http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html). These
> categories are defined in HarfBuzz by file
> hb-ot-shape-complex-indic-table.cc. I was about to formally submit my
> suggestions to the Unicode Technical Committee, but then I discovered
> that the changes would adversely affect HarfBuzz.
>
> The first problem arose with U+1A7B MAI SAM. While there
> is no problem with its uses to indicate word (or phrase) repetition by
> marking the last akshara and to indicate the merger of two 1-consonant
> vowelless consonant stacks, a dotted circle occurs in the example
> example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI
> SAM, U+1A6B SIGN O, U+1A41 RA>. The problem is that MAI SAM has an ISC
> of 'other', so U+25CC in inserted before SIGN O. Making MAI SAM a
> 'dependent vowel' as I had suggested fixed this problem.
>
> The second problem arose with U+1A7A RA HAAM, and could also arise with
> U+1A7C KARAN. The problem is that with the influx of foreign loans
> into Thai, in Thailand there are now clusters of two consonants in which
> the *first* consonant cluster is silent. In most cases, there is no
> way for Tai Tham to show which is silent, but when the tail of the
> second consonant rises to the hanging baseline, the placement of the
> cancellation marks tends to show which consonant is cancelled. A
> (hpyothetical) example is the English surname 'Dawes', which is
> represented with three consonants in Thai. The transliteration of 'w'
> is marked as silent. Conversely, 'Howes' would be written with the
> transliteration of the 's' as silent. This prevents the font
> deciding the placement of the cancellation mark on a cluster by cluster
> basis. Following the lead of Thai, this would be written <U+1A2F DA,
> U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM, U+1A60 SAKOT, U+1A48
> HIGH SA>.
>
> LibreOffice 4.3.4 splits the cluster into three syllables, <WA, SAKOT>,
> <RA HAAM> and <HIGH SA>, and the problem is simply that the subscript
> form cannot be generated until after the syllable boundaries are
> dropped. This is simply a variant of the font-soluble but for the
> future eliminated tone and SAKOT problem.
>
> HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>,
> <RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of
> 'other'. New ISC marks RA HAAM as a 'pure killer'. Unfortunately,
> this does not change the misdeduced syllable structure. I think the
> analysis needs to treat the sequence 'pure killer', 'invisible stacker'
> as being within a single syllable. Is this too much to ask for?
>
> The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC
> DOT, and possibly is not a real problem. I have too few examples of
> the character's use. CRYPTOGRAMMIC DOT currently has an ISC of 'other',
> so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence <U+1A49
> HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three
> syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>. It is
> possible that the input sequence will not occur in the wild. In 'New
> ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the sequence
> is treated as a single syllable, as desired.
>
> The next problem was with the admittedly unusual writing <U+1A93 THAM
> DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'. None of the
> three versions allowed the digit to be treated as a consonant base, and
> so U+25CC was introduced before SAKOT. Does the SEA engine need to be
> specifically instructed to treat Tai Tham decimal numbers as potential
> character bases?
>
> Some of my changes for 'New ISC' had bad consequences. Changing
> U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel
> resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into
> two syllables, <LOW CA, SAKOT> and <LAE>. While the font can work
> round this, this is not good.
>
> Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to
> 'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75
> TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC,
> TONE-1>. This seems odd; U+0ECD LAO NIGGAHITA is classified by
> Unicode as 'bindu', yet regularly has tone marks mounted on it. Is the
> syllable splitting here a HarfBuzz error?
>
> Richard.
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
--
behdad
http://behdad.org/
More information about the HarfBuzz
mailing list