[HarfBuzz] harfbuzz review

Thu Apr 15 23:19:10 PDT 2010

Behdad Esfahbod wrote:

> >> You seem to drastically under-estimate the headaches of
> >> lifecycle-management in a library.  Mozilla also disables cairo's
> >> mutex use.  So I don't think Mozilla is the perfect use case of
> >> those features.  When you are dealing with FT_Face and hb_face_t
> >> objects passed around between cairo, pango, poppler, Python
> >> wrappers, etc, the resource management API is absolutely
> >> necessary IMO.
> >
> > I'm not underestimating it, I'm simply arguing that the code for
> > doing that resource management does not belong in Harfbuzz.
> 
> "resource management" != object lifecycle management.

I'm not arguing that resource management and/or object lifecycle
management aren't needed, I'm simply saying that code for doing these
functions belongs outside the core shaping code and is better-handled
via an API that apps can tune to fit their needs.

> Allowing the user code to cache the sanitized tables is planned,
> just not provided right now.

Ok, great.

> You ignored part of my reply that I said I have plans to add API to
> decouple sanitizing from layout.  In my design, when that API is
> added, you get the functionality you are asking for.  Remains your
> request that the sanitize code is put in a separate module.  I don't
> see any benefit in doing that.

Hmm, if by decoupling you mean that it's possible to provide the
shaping routines with pre-sanitized font tables or some other
sanitized intermediate form, that's great. Then apps can use a
harfbuzz utility routine to sanitize font tables when necessary.

> > App usage patterns vary.  In the case of GSUB/GPOS tables, the
> > sanitized table structure itself may not be the ideal structure to
> > cache because untangling all the interrelated intra-table and
> > inter-table references requires still more processing before use.
> 
> What kind of more processing?  Are you suggesting that I should
> parse GSUB/GPOS tables into an internal structure and serialize into
> a different format?  No thank you.

No, I'm not suggesting a different serialization format, I'm saying
that instead of caching the full GSUB/GPOS tables it might be better
to cache some data structure that's been distilled for a given purpose.

For a lot of documents, hb_shape will be called repeatedly with the
same font/script/lang combination so it would make sense to cache the
lookups for that font/script/lang combination rather than
reconstructing it on each call to hb_shape.  Basically the lookup
arrays produced by calling setup_lookups in _hb_ot_substitute_complex
and _hb_ot_position_complex.

> >> Going from tags to strings and vice versa is at most a byteswap.
> >
> > Below is the code in hb-ot-tag.c that converts between OpenType
> > language tags in string form to 4-byte tag form.  It's a little
> > more than a byteswap, note the bsearch call.
> 
> Wrong.  The expensive part is converting from RFC-3066 / BCP-49
> language tags to OpenType tags.  If I drop that code, Firefox,
> Pango, and every other user of harfbuzz has to do that non-trivial
> conversion.  Where do you think '<html lang="fa">' is converted to
> 'FAR '?  I don't see why that code doesn't belong into harfbuzz when
> the HB API is designed for humans who don't necessarily know about
> OpenType.

I'm sure you're right, converting from RFC-3066 / BCP-49 language tags
is more expensive in comparison.  But by using strings to identify
language, script and feature tags you're forcing the conversion from
these strings to OpenType tags to occur *every* time a buffer is
passed to hb_shape.  For a given document using a single language that
conversion really only needs to happen once for all buffers passed to
hb_shape.  Ditto for script and feature tags.

I think it's fine to have a utility API that does the conversion from
a string-based language tag to 4-byte tags, equally usable by humans
or four-legged creatures.

> >> It's definitely more complex than the code it's replacing (which was
> >> plain C). However, it's very carefully designed, easier to audit,
> >> and I'm much more confident in it than in the code it replaces.
> >
> > Complex code for dealing with complex data is one thing but the
> > Harfbuzz code makes even simple things complicated in some cases. Take
> > as an example some of the struct's defined in hb-open-type-private.hh,
> > there's a whole set of *structs* defined for basic data types.  There
> > are far, far simpler ways of dealing with endian data than this.  What
> > advantage does this complexity add?  Does it somehow simplify the code?
> 
> I'm all ears to hear about those "far, far simpler ways of dealing with endian
> data".  What this code does is to obviate the need to call a bytecode
> conversion function/macro/... in thousands of places, each of which could be
> omitted by mistake.  So, review the ten lines and you're confident that there
> is no endian bugs in that 5kloc.

I agree, using structs for endian swaps is a good thing but you're
also "sanitizing" int's there!  Unfolding all the macros there it
looks like it's checking to see that the int field is within a given
bound; couldn't this be inferred with a bounds check on the containing
struct?

A typical endian-swapping struct might look something like this (this
is an aligned version but an unaligned version isn't that different):

  #define SWAP32(a)   /* swaps if needed depending on machine env */

  struct ULONG {
    ULONG(unsigned u) : v(SWAP32(u)) { }
    operator unsigned() const { return SWAP32(v); }
    unsigned v;
  };

Used in a TrueType table directory:

  struct TableDirectory {
    ULONG  tag;
    ULONG  checkSum;
    ULONG  offset;
    ULONG  length;
  }

> Other than endianness, this chunk of code also defines int types
> that have no alignment requirements.

But table data in TrueType fonts is typically 4-byte aligned.  Is
there a situation where long/short data fields within a TrueType table
are odd-byte aligned?  Or you just want the code to handle the rare
case where a poorly made font has non-4-byte aligned tables?

Cheers,

John