[HarfBuzz] Dotted Circles in Tai Tham

Roozbeh Pournader roozbeh at google.com
Mon Feb 23 17:59:44 PST 2015


Richard,

I am working on a new update of InSC for Unicode 8.0, which is available at
https://github.com/roozbehp/unicode-data.

After that, we'll push that into HarfBuzz.

It would be best if you suggest updates to the Unicode property instead,
including potentially subdividing a property value. In this way, users of
all implementations (including Microsoft's Universal Shaping Engine) would
benefit.

Please take a look and send me or UTC your suggestions (or file bugs at
https://github.com/roozbehp/unicode-data/issues). If there was still a need
to change something in HarfBuzz, we can do that too.

On Mon, Feb 23, 2015 at 4:31 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sun, 1 Feb 2015 01:39:42 +0000
> Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
>
> > I've been having some problems with spurious dotted circles in various
> > versions of HarfBuzz, and I thought I would share before proposing a
> > complete solution to Behdad.
>
> Well, no-one has shown any interest, so I will go ahead with my
> proposals/requests.  For ease of reference, I have deleted little from
> my original post.
>
> > I've been looking at 3 versions of HarfBuzz:
>
> > 'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz
> > is in that version of LibreOffice.
>
> When checking the version later, I saw 'LibreOffice 4.3.3.2', so it's
> possible LbreOffice 4.3.4 is different.
>
> > 'HarfBuzz 0.9.38+', i.e. the latest sources at some time today.
>
> Some time on Saturday 31 January 2015 might be more precise.
>
> > 'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable
> > Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post
> > 'Indic Syllable Categories'
> > http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html).
>
> > These categories are defined in HarfBuzz by file
> > hb-ot-shape-complex-indic-table.cc.  I was about to formally submit my
> > suggestions to the Unicode Technical Committee, but then I discovered
> > that the changes would adversely affect HarfBuzz.
>
> > The first problem arose with U+1A7B MAI SAM.  While there
> > is no problem with its uses to indicate word (or phrase) repetition by
> > marking the last akshara and to indicate the merger of two 1-consonant
> > vowelless consonant stacks, a dotted circle occurs in the example
> > example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI
> > SAM, U+1A6B SIGN O, U+1A41 RA>.  The problem is that MAI SAM has an
> > ISC of 'other', so U+25CC in inserted before SIGN O.  Making MAI SAM a
> > 'dependent vowel' as I had suggested fixed this problem.
>
> > The second problem arose with U+1A7A RA HAAM, and could also arise
> > with U+1A7C KARAN.  The problem is that with the influx of foreign
> > loans into Thai, in Thailand there are now clusters of two consonants
> > in which the *first* consonant cluster is silent.  In most cases,
> > there is no way for Tai Tham to show which is silent, but when the
> > tail of the second consonant rises to the hanging baseline, the
> > placement of the cancellation marks tends to show which consonant is
> > cancelled.  A (hpyothetical) example is the English surname 'Dawes',
> > which is represented with three consonants in Thai.  The
> > transliteration of 'w' is marked as silent.  Conversely, 'Howes'
> > would be written with the transliteration of the 's' as silent.  This
> > prevents the font deciding the placement of the cancellation mark on
> > a cluster by cluster basis. Following the lead of Thai, this would be
> > written <U+1A2F DA, U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM,
> > U+1A60 SAKOT, U+1A48 HIGH SA>.
>
> > LibreOffice 4.3.4 splits the cluster into three syllables, <WA,
> > SAKOT>, <RA HAAM> and <HIGH SA>, and the problem is simply that the
> > SAKOT>subscript
> > form cannot be generated until after the syllable boundaries are
> > dropped.  This is simply a variant of the font-soluble but for the
> > future eliminated tone and SAKOT problem.
> >
> > HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>,
> > <RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of
> > 'other'.  New ISC marks RA HAAM as a 'pure killer'.  Unfortunately,
> > this does not change the misdeduced syllable structure.  I think the
> > analysis needs to treat the sequence 'pure killer', 'invisible
> > stacker' as being within a single syllable.  Is this too much to ask
> > for?
> >
> > The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC
> > DOT, and possibly is not a real problem.  I have too few examples of
> > the character's use.  CRYPTOGRAMMIC DOT currently has an ISC of
> > 'other', so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence
> > <U+1A49 HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three
> > syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>.  It
> > is possible that the input sequence will not occur in the wild.  In
> > 'New ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the
> > sequence is treated as a single syllable, as desired.
>
> The first (MAI SAM) and third (CRYPTOGRAMMIC DOT) problems are solved
> by recategorising the characters as interior members of the syllable,
> with Indic Syllabic categories 'matra' and 'nukta' respectively.  I
> recommend that HarfBuzz make these changes, in the file
> hb-ot-shape-complex-indic-table.cc.
>
> MAI SAM is an odd matra, as its placement is determined by its phonetic
> role, not its visual position.  It is worth noting that this mark is a
> superscript version of U+1A91 TAI THAM THAM DIGIT TWO, and that its
> core meaning is that there are two of something, not just one of
> something.  A classification as 'Consonant_medial' would work just as
> well.
>
> For the second problem, an analogue can be found in the Khmer
> sequences <U+179C KHMER LETTER VO, U+17CD KHMER SIGN TOANDAKHIAT,
> U+17D2 KHMER SING COENG, U+179F KHMER LETTER SA> and its anagram
> <U+179C, U+17D2, U+179F, U+17CD>, which render without complaint and
> slightly differently in both HarfBuzz and Windows 7 (other versions not
> tested).
>
> Now, at present, U+17CD is classified as 'Vowel_Dependent' by an
> explicit override in gen-indic-table.py, the generator of
> hb-ot-shape-complex-indic-table.cc.  The same treatment would suffice
> for U+1A7A RA HAAM and U+1A7C KARAN.
>
> > The next problem was with the admittedly unusual writing <U+1A93 THAM
> > DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'.  None of the
> > three versions allowed the digit to be treated as a consonant base,
> > and so U+25CC was introduced before SAKOT.  Does the SEA engine need
> > to be specifically instructed to treat Tai Tham decimal numbers as
> > potential character bases?
>
> The answer, I see, is that it does need to be so instructed.
>
> > Some of my changes for 'New ISC' had bad consequences.  Changing
> > U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel
> > resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into
> > two syllables, <LOW CA, SAKOT> and <LAE>.  While the font can work
> > round this, this is not good.
>
> Occasional subscripting of ancient independent vowels has been
> reported, and I think HarfBuzz should support this behaviour.
>
> > Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to
> > 'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75
> > TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC,
> > TONE-1>.  This seems odd;  U+0ECD LAO NIGGAHITA is classified by
> > Unicode as 'bindu', yet regularly has tone marks mounted on it.  Is
> > the syllable splitting here a HarfBuzz error?
>
> The problem with anusvara (Indic syllabic category  'bindu') is that
> there are two types - those that terminate the syllable (a subgroup of
> Indic syllabic category type OT_SM), and those that are more matra-like
> (the rare category type OT_A).  The file
> hb-ot-shape-complex-indic-table.cc maps them to the category type
> OT_SM, but the SEA syllable analyser is set up for category OT_A.
> At present, assignments to Indic category OT_A are done by
> executable code checking the character codes, and many of the
> characters in this group are in fact Vedic tone marks!
>
> I think this is an area where HarfBuzz will just have to override the
> Unicode settings - the general categorisations don't help with layout
> constraints.
>
> Richard.
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20150223/eb302454/attachment-0001.html>


More information about the HarfBuzz mailing list