[HarfBuzz] harfbuzz review

Wed Apr 14 23:16:13 PDT 2010

As part of ongoing work at Mozilla I was asked to review the new
harfbuzz-ng code for inclusion in Firefox. I realize there are other
projects with differing constraints involved but I'm concerned about
the structure of the current code. Given the relative complexity of
OpenType layout and the structure of the GSUB/GPOS/GDEF tables,
OpenType shaping code is bound to be somewhat complex.  But I think
the structure of Harfbuzz code currently is much, much too complex.

One part of this complexity is the mishmash of language conventions
used, the API is C but the code is a limited subset of C++.  On the
mailing list, there seems to be confusion about this:

http://lists.freedesktop.org/archives/harfbuzz/2009-November/000460.html

> On Fri Nov 6 07:14:27 PST 2009, Adam Langley agl at google.com wrote:
> On Thu, Nov 5, 2009 at 11:46 PM, Martin Hosken <martin_hosken at sil.org> wrote:
>> I'm about to try adding graphite support as a contrib to
>> harfbuzz-ng. Graphite itself is in C++ and so the linking code is
>> going to have to bridge between C and C++. But I notice that you go
>> to great lengths to ensure that harfbuzz doesn't link to libstdc++.
>> I am wondering why this is the case.
> 
> Harfbuzz is a C library. It would be very sad if it needed to pull in
> libstdc++. That would pollute Pango, and then the rest of the
> GNOME/GTK world.
> 
> AGL

Despite this, some of the code in Harfbuzz is clearly using C++ but in
a way so as to avoid pulling in libstd++.  There's a big mix of
macro-ized code, template use and classes-as-structs coding.

The code for sanitizing OpenType tables is a good example of this
(hb-open-type-private.hh). It uses a combination of templatized
structs and macros to implement a generalized way of assuring that
font table data is copasetic.  This is code required to prevent things
like potential buffer overflow bugs.  But the way it's written it's
difficult to confirm easily what is being checked and how well.

There are also parts of the API that involve resource management that
I think really don't belong there.  For example, font tables are
handled as arbitrary binary objects, hb_blob's:

  struct _hb_blob_t {
    hb_reference_count_t ref_count;

    unsigned int length;

    hb_mutex_t lock;
    /* the rest are protected by lock */

    unsigned int lock_count;
    hb_memory_mode_t mode;

    const char *data;

    hb_destroy_func_t destroy;
    void *user_data;
  };

There's a ref count, a lock and a memory mode.  This seems like
complete overkill for a shaper.  Part of the reason for the lock and
memory mode is the way font tables are sanitized, the Harfbuzz code
tries to clean up font tables directly, either on the original table
data or by making a copy of the data.  

The table sanitizing code is currently embedded within the layout object:

void
_hb_ot_layout_init (hb_face_t *face)
{
  hb_ot_layout_t *layout = &face->ot_layout;

  memset (layout, 0, sizeof (*layout));

  layout->gdef_blob = Sanitizer<GDEF>::sanitize (hb_face_get_table (face,
HB_OT_TAG_GDEF));
  layout->gdef = &Sanitizer<GDEF>::lock_instance (layout->gdef_blob);

  layout->gsub_blob = Sanitizer<GSUB>::sanitize (hb_face_get_table (face,
HB_OT_TAG_GSUB));
  layout->gsub = &Sanitizer<GSUB>::lock_instance (layout->gsub_blob);

  layout->gpos_blob = Sanitizer<GPOS>::sanitize (hb_face_get_table (face,
HB_OT_TAG_GPOS));
  layout->gpos = &Sanitizer<GPOS>::lock_instance (layout->gpos_blob);
}

Font table sanitizing really belongs in a separate font management
layer that Harfbuzz accesses.  In some cases sanitizing may be
expensive enough that it would make sense to cache the sanitized
results.  In other environments, a closed embedded environment with a
fixed set of fonts for example, it would make sense to sanitize the
fonts once at build time and not do the sanitizing at runtime.  The
sanitizing process may be part of a larger process that sanitizes all
tables before use.  Chrome for example completely reconstructs
downloaded font data, so doing it again as part of the shaper would be
redundant.

I also think the degree of caching will depend a lot on access
patterns and the size and complexity of the data in a given table. 
For small, simple tables, simply caching the table data may make the
most sense.  But for tables like the GSUB/GPOS tables with all their
nested offsets, some other form of data structure could easily make
better sense in a given environment.

There are some other design choices that don't make sense to me, the
code appears to be grouping feature tags with script and language
tags and making them all strings, even though all of these tags are by
definition four-byte tags in OpenType:

  typedef struct _hb_feature_t {
    const char   *name;
    const char   *value;
    unsigned int  start;
    unsigned int  end;
  } hb_feature_t;

There was discussion on the list about this [2].  From the design doc
description [1] this appears to be an attempt at future-proofing but
the library only attempts to implement shaping of OpenType, I don't
think it makes sense to design for things that don't exist.  This
leads to wasted cycles going between tags and strings.

Summary of my suggestions for Harfbuzz:

1. Remove resource mgmt APIs, replace with callbacks for accessing font data.  
   No locking/refcounting/mem swizzling internally.

2. Make the API and implemenentation C code if C-only code is a requirement,
   or normal C++ if it's not

3. Simplify font table sanitizing code and move it to a utility library

Regards,

John Daggett
Mozilla Japan

[1] Behdad's design overview of Harfbuzz:
    http://lists.freedesktop.org/archives/harfbuzz/2009-August/000359.html

[2] Feature/Script/Language tags discussion
    http://lists.freedesktop.org/archives/harfbuzz/2009-December/000495.html