[HarfBuzz] On optimizing HarfBuzz

Sat Jun 9 17:43:07 PDT 2012

Hi,

I happened to look into optimization opportunities in HarfBuzz and these are
my findings:

For very short strings, preparing the shape plan seems to take a significant
portion of the shape() call, so transparently caching the plan as Jonathan and
I agreed we should do, would address that.  On the plus side, mutexes (as
would be required for that) seem to have negligible overhead.

For longer text, it's a different story.  Seems like, contrary to my
expectation, all the passes we do in hb-ot-shape.cc (getting Unicode
properties, mirroring, cmap mapping, normalization, default positioning, etc)
have really insignificant overhead.  This is great.

What *is* a big problem is the actual OpenType lookups themselves.  And how
bad that is depends directly on the font.  There are fonts with zero OpenType
lookups (eg. Droid Sans), and there are fonts that have over a 100 sublookups
for a simple shape plan (eg. Arabic Typesetting).  Seems like, currently we
are being very inefficient for those larger fonts.  But even for simpler fonts
with around 20 sublookups (eg. DejaVu Sans) we are doing an awful job.  I
tried to understand this, and here are my findings.  I think I have a good
grasp on what's causing the slowness.  Less so on how to address it, but I
have ideas.  Note that Uniscribe doesn't seem to be nearly as affected by the
number of lookups, so there ought to be ways to address this.

To understand, lets see what the main apply() loops for GSUB and GPOS
currently look like:

1  for each lookup {
2    for each glyph in buffer {
3      for each sublookup of the lookup {
4        switch on the lookup_type {
5          switch on the sublookup_type {
6            if glyph in coverage of the sublookup {
7              do lots of work...
             }
           }
         }
       }
     }
   }

Now, *that*'s expensive!  Note that the passthrough rate at 6 is very very
low.  Ie. most glyphs don't pass the coverage check for most sublookups.  But
we do a hell of a lot of work to get to that conclusion.  In short, I think we
don't have to worry about speeding up the actual lookup processing.  We need
to make to the "first glyph doesn't match" conclusion much much faster.

The actual coverage matching if constrained by the OpenType data-structures
(unless we want to do caching, etc, more on that below).  However, for
example, we can get rid of the switching on lines 4 and 5 for most cases.  The
way OpenType was originally designed, all sublookup types of all lookup types
had the coverage table offset in the same place.  So we can move line 6 to
before line 4.  Now, newer additions to the spec (Context subtype 3,
ChainContext subtype 3, and Extension subtype 1) don't have the coverage in
the same place.  However, by rewriting the loop at line 2 to handle the 3
cases separately we can save lots of churning.  I did an initial take of this
by adding "fast-path" to GSUB/GPOS, but didn't reorder the actual loops.  Will
be doing that soon.  Also, by simply looking up the coverage table before
looping over each glyph, we save a lot.

For fonts that were slow, I found that they have lots of ChainContext subtype
3.  I understand why this is.  Unlike the rest of the lookups, each Context
subtype 3 and ChainContext subtype 3 can only match a single sequence.  So, if
you have 100 different contextual ligatures, you need 100 subtables...  Now,
inspecting Arabic Typesetting, I also found that all the subtables in each
lookup reference the same Coverage table for the first character.  (BTW, I
have a vague memory that OTS does not like this.)  Currently, we ignore that
and check each glyph against each subtable.  We can easily check whether the
Coverage table is the same and skip rechecking.

This much is the easy part.  To significantly get faster, we need a way to
avoid the cartesian product of "all glyphs in the buffer" and "all subtables"
somehow.  That's were caching comes in, and as one wise programmer once said,
"All programming is an exercise in caching."  I have ideas around there too,
but I will harvest the low-hanging fruit before going there.

Hopefully the easy stuff gives us enough boost to let me go back and
concentrate on correctness before we need to optimize again.

Comments?

Cheers,
behdad