[HarfBuzz] ICU LayoutEngine "Canned" GSUB Tables

Tue Jul 10 13:08:53 PDT 2007

Hi,

The ICU LayoutEngine uses "canned" GSUB and GDEF tables to process 
Arabic text if the font doesn't have a GSUB table that covers the Arabic 
script. These tables use the Unicode code points instead of glyph ID. 
The tables are generated by an ICU4J tool that uses the Unicode 
character properties to identify ligatures and their components.

The character to glyph conversion is 1 to 1 and there needs to be a 
"real" character to glyph conversion after GSUB processing. Ligature 
substitution makes sure that the font actually contains the ligature 
presentation form before forming the ligature and multiple substitution 
makes sure that the font contains the component characters before 
performing the substitution. This is done in ICU by passing an optional 
"filter" object into GSUB processing. This object looks for the 
characters in the font's CMAP table.

In ICU, the shaper that does this is a subclass of the OpenType Arabic 
shaper. It references the canned GSUB and GDEF tables instead of the 
tables from the font and reimpliments the character to glyph and post 
processing methods.

The same GSUB and GDEF tables are used for ICU's canonical processing. 
This processing is intended to produce better display results for fonts 
that may have a limited repertoire. For example, if the input text 
contains "a" followed by umlaut, an a-umlaut character will be 
substituted if it's present in the font. Also, if the input text 
contains an a-umlaut character and the font doesn't have a glyph for it, 
it will be replaced by an "a" followed by an umlaut.

I spent some time on Friday morning at the summit looking at how to 
integrate this functionality into the HarfBuzz Arabic shaper. The 
obvious thing that needs to be added is the filter. The low-level GSUB 
routines will need to take an optional filter that can be used for 
ligature substitution and multiple substitution.

I thought that maybe the canned tables could be made available by 
hacking the code that looks up the tables to just return the canned ones 
if font doesn't have a "real" one. This won't really work though. For 
one thing, we should use the canned tables if the font contains a GSUB 
table that doesn't cover the Arabic script. Also, the caller needs to 
know if the substitution happened so that it can pass in the correct 
filter and do the right character to glyph mapping. I'll have to spend 
some more time studying this.

Regards,
Eric Mader
IBM GCoC - ICU Team