[HarfBuzz] On fallback shaping and future directions

Mon Nov 22 10:52:57 PST 2010

[Warning: Loooooong email ahead!]

Hi list,

Now that we have the Arabic shaper in place and working, I spent the past
couple weeks pushing the edge cases and in general trying to figure out what
loose ends are there that I need to fix before moving on to making a plan for
other shapers.  I then looked into fallback shaping in general, as would be
applied to Latin and other simple scripts as well as for Arabic, Hebrew, etc.
 Here is what I found.  Feel free to spec the Arabic section down to the Latin
one.  Read on.

Arabic
======

The goal here is to at least match what Pango has been doing, and possibly
improve on that.  Compared to what was in Pango, there's a few things missing
in current hb code:

  - Missing GDEF:  Many Arabic fonts in circulation do not have a GDEF table.
 Without a GDEF table, ligatures cannot be formed if combining marks appear
between two bases.  The OpenType spec even hints that if GDEF is missing, the
clients ought to synthesize one.  Previously harfbuzz had API to set glyph
classes, which Pango was using to build them using Unicode general category.
I removed that API since it's cumbersome to use, and will make hb_shape()
simply fallback to using general category if GDEF is missing.  I'm just a few
lines away from having that work already.

  - Missing GPOS:  Many Arabic fonts, specially those from Microsoft, do not
have a GPOS table.  This could all be just fine, except that many of them do
not have zero advance width for the marks either.  So, circumvention is
necessary.  Pango used to have API to simply zero advance width of marks and
the Arabic module simply used that API.  Note that it's in general not safe to
zero advance width of all combining marks as there are legitimate cases for
marks with positive advance width.  I'll cover this later in the non-Arabic
discussion, but worst case, we can add a post-positioning hook and use it in
the Arabic module to zero mark advances.

  - Missing GSUB:  Such fonts can be handled by fallback to using the
presentation forms encoded in Unicode, but since Pango never did that, I
wouldn't consider it a high priority, or even something that we should support
ever.

  - Feature apply order:  The order that feature lookups should be applied in
OpenType has been up for debate for quite a while.  Consensus seems to be
forming around the following:

    * For non-complex scripts, add in all the features and apply by lookup index.

    * For complex scripts:

      o Apply ccmp and select few other features (ltrm, ltra, rtlm, rtla?),
apply by lookup index.

      o Apply any complex-shaping features

      o Apply everything else, by lookup index order.

This is understood to be what Uniscribe does, and is almost what the spec
says.  Jonathan an I however have been thinking about applying all features in
one round for Arabic since the complex-shaping part is really simple for
Arabic.  That's indeed what the current hb code does, but I noticed that with
this setting, IranNastaliq, one of the most OpenType-heavy fonts I use for
testing, fails to form the Allah ligature with hb while it was working just
fine with Pango.  Pango used to apply features one at a time.  Checking the
font, indeed I see that the ccmp lookups are at the end of the lookup list
while Uniscribe applies them early in the shaping process.  So unfortunately I
have to move to breaking up the shaping around complex features for Arabic.  I
wish I didn't have to.  But since that's what we have to do for Indic anyway,
I don't mind putting the infrastructure in there.

  - ZWNJ/ZWJ, etc:  Pango also used to remove these characters from the glyph
stream.  Not actually removing them in fact, but replacing them with "empty"
glyphs.  That's ok for Arabic, but in general we should handle those two
special characters much better.  In particular, the OpenType engine should
simply ignore ZWJ when forming ligatures.  In fact, a wishful reading of the
Unicode and OpenType specs suggests that maybe we should turn 'dlig' feature
on for characters before ZWJ.  How does that sound?  Regardless, we have to
get to the Indic shaper to see what exactly we have to do with these two.

So much for Arabic.

Latin
=====

We didn't do any fallback shaping in the non-complex renderer in Pango, so I
was happy to launch without any in harfbuzz.  However, the prevalence of
webfonts in the target audience of harfbuzz changed this recently.

In particular, the following Firefox bug:

  http://bugzilla.mozilla.org/show_bug.cgi?id=609604

shows that the widely available Georgia font is extremely broken, almost
beyond repair.  I'll quote Jonathan Kew's comment #16 from the bug:

<blockquote>

"This occurs because of a number of errors and deficiencies in the Georgia
font - at least in version 5.00, which I have on both Windows and Mac systems
here. It lacks GPOS support for "mark positioning", to properly place
combining marks over base characters; the glyphs for combining mark characters
(U+03xx) have non-zero widths; and in addition, the GDEF table incorrectly
classifies the glyphs for the combining marks as "Base Glyph" rather than
"Mark Glyph".

It seems that DirectWrite hacks around this error in *some* cases by
automagically using the precomposed character in place of base+diacritic
combinations (e.g. it renders <e, combining caron> using the single <ecaron>
glyph); however, it doesn't do this consistently for all such combinations, as
can be seen with the <u, combining ring above> sequence seen in the word
"svůj" and several other instances in the example.

The behavior of other font-shaping systems varies a bit: in Safari, I see
results similar to DirectWrite (the e-caron combination is handled, but u-ring
renders with the ring mispositioned); in Minefield with harfbuzz disabled
(i.e. using our Core Text path) even the u-ring is handled nicely. The
difference here seems to be that the combining ring (U+030A) is entirely
absent from Georgia, whereas the combining caron (U+030C) is present but
improperly implemented. ...... Aha, that'll be why DW doesn't handle <u,
combining ring above> nicely: the ring is missing from the font and so it
falls back to a different font, and then it fails to combine the base+mark.
Likewise in Safari. Our Core Text path, on the other hand, manages to use the
precomposed <u-ring> character even though the ring alone was not supported in
the font."

</blockquote>

I've been thinking about how to *fix* that and similar cases.  Here's a few
different ideas I've come up with so far:

  - Incorrect GDEF: For sure GDEF is an absolutely necessary table for any
well-crafted OpenType font, not the least because of the mark attachment
classes and mark glyph sets.  But the regular glyph classes are of much less
importance that they initially suggest.  In particular, we only care about
mark vs non-mark classes.  I wonder if we should use Unicode general category
to *adjust* GDEF mark classes.  That is, always mark a glyph as a mark class
if general category suggests so, even if the GDEF class says non-mark.

I can run a check over my library of fonts to detect all the cases that such a
heuristic will be different from what GDEF says, and inspect those fonts to
see if using the heuristic we would regress or improve things.  A better
heuristic may be to only do this if the GDEF table doesn't have a mark
attachment class table or mark filtering sets, suggesting that it's a poor
quality font.

Note that the case of completely missing GDEF glyph classes will be handled as
described already in the Arabic section above.

  - Missing / incomplete 'ccmp':  Technically it's the job of the 'ccmp' table
to compose / decompose the combining marks with / from their bases to achieve
the best possible rendering the font can offer.  In reality however most fonts
simply don't do a good job of that.  To do better, many shaping engines
compose / decompose the character stream, and I want to do the same in HarfBuzz.

Years ago Eric Mader noted that in ICU Layout he does composition by using the
GSUB machinery on a "canned" OpenType table.  By "canned" table we mean a blob
of GSUB tables that have been pregenerated from Unicode and other data and
included as part of the shaper.  These canned tables work on the character
data, not glyph date, and hence are font independent.  They have to be used
before the 'cmap' mapping is applied.  The downside however is that for them
to be useful, the GSUB machinery needs to be adjusted to make a get_glyph()
call before making any substitution.  So, this is a nice trick to be aware of,
but not necessarily the best solution to every substitution problem.

Here is the plan of attack I'm comfortable implementing:  I'll add two Unicode
funcs, one for composing and another for decomposing.  One can do full Unicode
NFC and NFD normalization by recursively applying 2-to-1 / 1-to-2 character
mappings.  If we have those two calls then, and given the Unicode combining
classes that we already have, in our 'cmap' loop we can try decomposing a
character if it's not in the 'cmap', or try composing a mark with its base.
The composing part perhaps should be tried even if the font supports the mark.
 The reason being that if the font supports a precomposed form, it's almost
always superior to the decomposed rendering.  In ideal cases, with high
quality fonts, the two rendering will be the same.  But for low qual fonts
with no proper mark positioning, the precomposed forms are always higher quality.

In implementing the above I'm tempted to implement the Hangul Jamo
composition/decomposition programatically in harfbuzz itself.  So, the
callback just need to handle the non-Jamo pairs.  I do know that pair
composition/decomposition is not something that most Unicode Character
Database implementations expose.  Most Unicode libraries only export functions
to normalize full strings.  I'll see how that goes.  Oh, and with this stuff
in place, we wouldn't need a separate Hangul module anymore.

When we compose/decompose in harfbuzz, it would be harder for the higher-level
to itemize text based on font coverage, since it would also then have to try
the same compose/decompose opportunities before using fallback fonts.  In the
future we may add itemization helpers in harfbuzz itself such that the same
logic can be reused by the itemizer.

While we are at composing/decomposing, we can also stop and think about NFKD.
 I mean, would be nice if when you try to render U+2474 PARENTHESIZED DIGIT
ONE and your font doesn't support it, you get "(1)" instead.  If there's a
place to handle this, it's in the shaper.  Maybe I add a callback for NFKD
decomposition and only bother if the client has implemented that callback.
Not so sure about the signature for the callback though.

So much for 'ccmp' enhancements.

  - Positive mark advance width:  This is kinda similar to the Arabic case,
except that because of the rendering direction (rtl), for Arabic it's enough
to set the mark advance width to zero, but in Latin, we need to also make sure
that the mark resides on the left side of the origin, not the right side,
after we zero the advance.

Or we can take a completely different approach: use 'canned' tables to do a
full 'mark' and 'mkmk' fallback implementation:

In this solution, we first populate GDEF's mark attachment type by the Unicode
combining class of the characters, then use a pre-generated GPOS table to
attach marks to bases based on their combining class type, and glyph extents.
 To do the attachment we use a special get_contour_point callback that would
use the original font's get_glyph_extents to return one of the few
predetermined attachment points for the glyph (think "top-right", "bottom", etc).

Using the same technique we can also do 'mkmk' positioning.  We can even do
this if the font has a GPOS table but not 'mkmk' (and no mark attachment
classes for that matter).

One problem still remains though: mark advance widths are still not set to
zero using the above stuff.  Previously we used to zero the mark advance width
in MarkBase and similar lookups, but stopped doing that since it broke some
well-designed fonts (DejaVu Sans Mono for example).  I'm not sure how to zero
the advance width short of adding fallback code to do that, but then I'm not
sure which mark classes should get that treatment.

If we do the above, the heuristics in the current fallback Hebrew shaper in
Pango can also be implemented as canned GPOS and obviate the need for a
separate Hebrew shaper.

I'm going to try to come up with a scheme to allow us compose GPOS tables in C
code without too much hassle.  If that fails I may fall back to using XML
source to feed to ttx or something.  Not sure.

So, that's it for now.  Lots of stuff to juggle.  Comments?  I'll keep working
on implementing these stuff for the next few weeks.

There's also some off-list discussion going on regarding various East and
South Asian scripts.  Still, would be nice if someone stands up and takes the
lead on those.

Also coming soon is a document about the design of the core shaper, hopefully
helping people adding new complex shapers.

Cheers,
behdad