[HarfBuzz] Kerning in Telugu (and other similar languages)

mathog mathog at caltech.edu
Wed Aug 13 17:02:18 PDT 2014


Greetings,

Behdad Esfahbod said that this would be a good place to ask, even though 
the underlying software is Pango and not Harfbuzz.  I am trying to 
resolve a bug in Inkscape

   https://bugs.launchpad.net/inkscape/+bug/1282968

that involves this string in Telugu:

   
U+0C07,U+0C02,U+0C15,U+0C4D,U+200C,U+0C38,U+0C4D,U+0C15,U+0C47,U+0C2A,U+0C4D

As will soon become apparent, I know next to nothing about Indic 
languages.  Counting "stacked glyphs" as one glyph that is supposed to 
render as 6 glyphs

   
https://bugs.launchpad.net/inkscape/+bug/1282968/+attachment/3989437/+files/correct-rendering.png

Pango breaks these 6 up into 3 logical clusters as 2:3:1.

First question:  are the colon positions the proper places to insert 
kerning spaces?

pango_shape() descriptions of most languages have the property that each 
logical cluster begins with a character with the "is_cursor_position" 
attribute set, and it is not set elsewhere in the cluster.  (In the 
European languages each logical cluster is usually one letter with 
possibly one or more accents or other similar modifiers.)  That is 
almost the case here too, except there is a second character within the 
2nd logical cluster that also has that bit set (character 8, 0C15).

Behdad referred me to this document:

   
http://www.w3cindia.in/Indic-req-draft/Indic-layout-requirements.html#letter-spac

which says that aksharas are supposed to move around as a block. Cursor 
positions are generally where kerning spaces are inserted.  I don't know 
how to reconcile this situation, so...

Second question:  Is the 2nd logical cluster returned by Pango something 
larger than an akshara?

Third question: in a text editor for this sort of language, are the 
cursor positions restricted to the akshara transitions, or does the 
cursor move around within an akshara stopping at each "stack glyph"?  
What happens now in Inkscape is that one can delete unicode characters 
off the tail of a logical cluster, but there is no way to move around 
within it.  This text can only be entered with control codes 
(^U0C07<return>^U0C02<return> etc).

Fourth and last question:  the string includes 200C, which is a "zero 
width line nonjoiner".  There is no corresponding glyph - all it does is 
prevent two adjacent characters from merging into a different kind of 
glyph.  In terms of cursor motion, if one were moving across the text 
left to right with "right arrow" presses, would one expect the cursor to 
stay in the corresponding spot for two such presses, as if moving across 
a zero width character, or would its presence not affect cursor motion?

Thank you,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the HarfBuzz mailing list