[HarfBuzz] NOW: Shaper for SEA Scripts --- WAS: Funky harfbuzz compile error

Fri Apr 29 11:21:32 PDT 2011

Hi, Martin,

On Thu, Apr 28, 2011 at 10:47 PM, Martin Hosken <mhosken at gmail.com> wrote:
> Dear Ed,
>
>> Hi, Behdad,
>
> Hmm I seem to have been locked out of the harfbuzz list. Behdad could you look into it at some point. TIA.
>
>> Now that everything is working fine, I am going to start taking a look
>> at writing shaper
>> code to handle Tai Tham and related Southeast Asian scripts (New Tai
>> Lue, Tai Viet, and Myanmar; eventually probably also adding
>> Cambodian).
>
> I don't think trying to support all of these scripts in a single shaper is a good idea.
>
> Tai Viet can be handled, like Thai, using the generic shaper

RE: Tai Viet -- Ooops, my bad!  OK, I had not read chapter 4,
"Character Properties" of the Unicode Standard Book since something
like version 3.0, and just assumed that Tai Viet, having been encoded
more recently, would follow the logical model used for Tai Tham,
Khmer, etc. rather than the visual model of Thai and Lao.  I stand
corrected -- and that's one less script to have to worry about! :-)

> New Tai Lue is the simplest script to shape given there is no OT lookup interaction required
>
> Why clog these two scripts up with having to do all the tai tham tests? I think writing
> a tai tham shaper would be great. Do you intend to add this shaper to all OT engines?

Well, please feel free to tell me so if I am being very naïve here --
but what I have been thinking of doing is writing somewhat generalized
"reordering" code that will shift the "reordrant" combining mark
characters (as listed in Table 4-4 of the Unicode Standard 6.0) to the
front of clusters (before glyph substitution occurs).

In theory --albeit possibly my own naïve theory-- this could be done
in a fairly generic fashion for a reasonable number, if not all, of
the (nineteen) Brahmic-derived scripts presented in the aforementioned
Table 4-4.  Relevant properties of the individual code points in the
string buffer would be looked up in script-specific tables.

I am considering the possibility of using bit masks to tag the table
code point entries by category.  Thus a single code point could be
tagged with multiple categories as required:

==>  U+1A6E TAI THAM VOWEL SIGN E would be "DEPENDENT_VOWEL | REORDRANT"

==> U+1A55 TAI THAM MEDIAL RA would be "CONSONANT | REORDRANT"

Naturally other bit masks will be defined as required.  For example,
we know that a syllable (cluster) may start with either a CONSONANT or
an INDEPENDENT_VOWEL.  So perhaps it would be useful to have a bit
mask category called CAN_BEGIN_CLUSTER.  That way, when looking for
the beginning of a cluster, we can just test for CAN_BEGIN_CLUSTER
instead of having to test separately for "CONSONANT or INDEPENDENT
VOWEL".

Basically whatever individual tests have to be performed against
individual code points determines what bit-mask categories should be
defined.  In theory, this will make the code and the associated tables
easy for humans to read and understand.

As I currently envision things, this "reorderer" works in two passes:

1. The first pass finds REORDRANT DEPENDENT_VOWELs and moves them to
the beginning of their respective clusters.  In the simplest case, the
reordrant dependent vowel get swapped with just a single consonant :
this is the only case that exists for NEW TAI LUE.  Consonant clusters
in other scripts may be written with several letters, so for other
scripts we may have to shift the reordrant dependent vowel several
places.

2. The 2nd pass finds REORDRANT CONSONANTs (i.e., MEDIAL RA).  As far
as I am aware at this point in my education, a REORDRANT medial RA
only needs to swap places with a single preceding consonant character,
so this case is as simple as Johnathan's implementation for New Tai
Lue.

Note my plan at this point is only (to attempt) to write the
"reordering" code.  Beyond Harbuzz's generic OT shaper, implementation
of other features may still be required in order to execute complete
shaping of various other Brahmic-derived scripts.  For Tai Tham, I
think the generic shaper plus reordering as described here is enough.
For other scripts, like Khmer, some additional features will be
necessary.  And for a lot of other scripts, I'm sure I don't know!

>
> Just because a shaper involves reordering doesn't mean that we only need one such > shaper.
>

I would love to hash this out over a round-table discussion about this
with you, Johnathan Kew, Theppitak Karoonboonyanan, Danh Hong, and
everyone else who understands the details of how these Brahmic-derived
scripts work.  Should HarfBuzz really have separate shapers for Tai
Tham vs. Myanmar vs. New Tai Lue vs. Khmer?  Or can these be
reasonably combined as suggested here?  Or, going even further, can
the "reordering" code be generalized to handle all 19 Brahmic-derived
scripts shown in Table 4-4, while other script-specific aspects of
shaping are broken out elsewhere?

Best - Ed

> Yours,
> Martin
>