[HarfBuzz] harfbuzz review

Wed Apr 14 23:20:42 PDT 2010

Based on feedback from Behdad, I made a few additional comments, included below.

Original post here:
https://bugzilla.mozilla.org/show_bug.cgi?id=449292#c55

> It's definitely more complex than the code it's replacing (which was
> plain C). However, it's very carefully designed, easier to audit,
> and I'm much more confident in it than in the code it replaces.

Complex code for dealing with complex data is one thing but the
Harfbuzz code makes even simple things complicated in some cases. Take
as an example some of the struct's defined in hb-open-type-private.hh,
there's a whole set of *structs* defined for basic data types.  There
are far, far simpler ways of dealing with endian data than this.  What
advantage does this complexity add?  Does it somehow simplify the code?

============== from hb-open-type-private.hh ==============

#define _DEFINE_INT_TYPE1_UNALIGNED(NAME, TYPE, BIG_ENDIAN, BYTES) \
  struct NAME \
  { \
    inline NAME& set (TYPE i) { (TYPE&) v = BIG_ENDIAN (i); return *this; } \
    inline operator TYPE(void) const { return BIG_ENDIAN ((TYPE&) v); } \
    inline bool operator== (NAME o) const { return (TYPE&) v == (TYPE&) o.v; }
\
    inline bool sanitize (SANITIZE_ARG_DEF) { \
      TRACE_SANITIZE (); \
      return SANITIZE_SELF (); \
    } \
    private: unsigned char v[BYTES]; \
  }; \
  ASSERT_SIZE (NAME, BYTES)

#define DEFINE_INT_TYPE1(NAME, TYPE, BIG_ENDIAN, BYTES) \
  struct NAME \
  { \
    inline NAME& set (TYPE i) { BIG_ENDIAN##_put_unaligned(v, i); return *this;
} \
    inline operator TYPE(void) const { return BIG_ENDIAN##_get_unaligned (v); }
\
    inline bool operator== (NAME o) const { return BIG_ENDIAN##_cmp_unaligned
(v, o.v); } \
    inline bool sanitize (SANITIZE_ARG_DEF) { \
      TRACE_SANITIZE (); \
      return SANITIZE_SELF (); \
    } \
    private: unsigned char v[BYTES]; \
  }; \
  ASSERT_SIZE (NAME, BYTES)
#define DEFINE_INT_TYPE0(NAME, type, b)    DEFINE_INT_TYPE1 (NAME, type##_t,
hb_be_##type, b)
#define DEFINE_INT_TYPE(NAME, u, w)    DEFINE_INT_TYPE0 (NAME, u##int##w, (w /
8))

DEFINE_INT_TYPE (USHORT, u, 16);    /* 16-bit unsigned integer. */
DEFINE_INT_TYPE (SHORT,      , 16);    /* 16-bit signed integer. */
DEFINE_INT_TYPE (ULONG,     u, 32);    /* 32-bit unsigned integer. */
DEFINE_INT_TYPE (LONG,      , 32);    /* 32-bit signed integer. */

/* Array of four uint8s (length = 32 bits) used to identify a script, language
 * system, feature, or baseline */
struct Tag : ULONG
{
  inline Tag (const Tag &o) { *(ULONG*)this = (ULONG&) o; }
  inline Tag (uint32_t i) { (*(ULONG*)this).set (i); }
  inline Tag (const char *c) { *(ULONG*)this = *(ULONG*)c; }
  inline bool operator== (const char *c) const { return *(ULONG*)this ==
*(ULONG*)c; }
  /* What the char* converters return is NOT nul-terminated.  Print using
"%.4s" */
  inline operator const char* (void) const { return CONST_CHARP(this); }
  inline operator char* (void) { return CHARP(this); }

  inline bool sanitize (SANITIZE_ARG_DEF) {
    TRACE_SANITIZE ();
    /* Note: Only accept ASCII-visible tags (mind DEL)
     * This is one of the few times (only time?) we check
     * for data integrity, as opposed o just boundary checks
     */
    return SANITIZE_SELF () && (((uint32_t) *this) & 0x80808080) == 0;
  }
};
ASSERT_SIZE (Tag, 4);
#define _NULL_TAG_INIT  {' ', ' ', ' ', ' '}
DEFINE_NULL_DATA (Tag, 4, _NULL_TAG_INIT);
#undef _NULL_TAG_INIT

==========================================================

> > One part of this complexity is the mishmash of language
> > conventions used, the API is C but the code is limited subset of
> > C++.  On the
> 
> What's so complex about it?  The API is C.  Anything else is
> implementation details.  The previous HarfBuzz also used C++ for
> some of the implementation. The way the hb-layout code uses C++ *is*
> very complex though, I admit, but not complicated.
> 
> > Despite this, some of the code in Harfbuzz is clearly using C++
> > but in a way so as to avoid pulling in libstd++.  There's a big
> > mix of macro-ized code, template use and classes-as-structs coding.
> 
> Suggestions for improvements are welcome.  The code works with gcc
> and MSVC, and have been tested to build fine on ARM (and has
> compile-time safeguards), so I don't see what's wrong with
> complexity if it works great.  That said, there *is* an outstanding
> mystery bug with the Apple toolchain.  We have not been able to nail
> that one down yet.

The point is not that it compiles correctly, it's whether the code is
simple and easy to maintain.  If all text in Firefox is going to pass
through this code we need for it to be sufficiently simple and easy to
debug by an engineer unfamiliar with the code.  Using complicated
nested macro-ized structs makes that job much, much harder.

> > There are also parts of the API that involve resource management
> > that I think really don't belong there.  For example, font tables
> > are handled as arbitrary binary objects, hb_blob's:
> 
> At the outset they are binary objects.  What's wrong with that?

Because font tables have a much more specific usage pattern than
arbitrary binary data.  Why obfuscate that fact?

> > Font table sanitizing really belongs in a separate font management
> > layer that Harfbuzz accesses.
> 
> I've considered that before.  The problem however is: there are
> numerous fonts out there that fail the literal word of the spec in
> one way or another.  Using an external sanitizer will result in
> rejecting the layout tables in that fonts. The aim of the harfbuzz
> sanitizer is to clean that up just enough such that we can use the
> correct bits of the tables.  It must closely match the code using
> the tables and that's why it's tied closely to the layout code
> (instead of being put in a separate module logically).

Sorry, can't an external sanitizer do the clean up you suggest?  

I'm suggesting that the sanitizing happen outside the shaper code, so
that apps are free to manage the sanitized data based on their needs. 
For example, a lot of platform fonts may already by sufficiently
sanitized that this isn't needed, the code could easily cache that
fact and skip the sanitizing process the next time.  Or for platforms
with a locked-down set of fonts that sanitizing could take place as
part of the build process, no need to do it at runtime.  The current
code doesn't permit this, it ties the sanitization to the create of a
layout object.

> The Chrome sanitizer doesn't check the OpenType Layout tables. 
> Given all the offsets in the GSUB/GPOS/GDEF tables, reconstructing
> them is a tedious task, and would be pointless when one can do with
> the much simpler sanitizer in harfbuzz.

Right, Chrome dumps the GSUB/GPOS/GDEF tables.  Reconstructing or
sanitizing are the same thing, just with different degrees of
strictness.  If it's tedious for these tables, all the more reason to
allow the sanitized versions to be easily cached outside of Harfbuzz
code.  That's sort of what the hb_blob memory mode stuff is trying to
allow.

> In the future, I'll add API and cmdline tools to sanitize fonts.  On
> Linux for example we'll cache the sanitizing result in the
> fontconfig cache and skip sanitizing for sane fonts.  Other
> environments will be able to do similarly.  It's planned, just not
> there yet.

The fontconfig cache is a good example of a more appropriate place for
resource management.

> > I also think the degree of caching will depend a lot on access
> > patterns and the size and complexity of the data in a given table.
> > For small, simple tables, simply caching the table data may make
> > the most sense.  But for tables like the GSUB/GPOS tables with all
> > their nested offsets, some other form of data structure could
> > easily make better sense in a given environment.
> 
> Not sure what kind of caching you are referring to.  Harfbuzz itself
> doesn't do any caching.

That's my point, Harfbuzz doesn't allow the caching of sanitized font
tables.  It would be better to allow apps to do this resource
management rather than Harfbuzz.  App usage patterns vary.  In the
case of GSUB/GPOS tables, the sanitized table structure itself may not
be the ideal structure to cache because untangling all the
interrelated intra-table and inter-table references requires still
more processing before use.

> Going from tags to strings and vice versa is at most a byteswap.

Below is the code in hb-ot-tag.c that converts between OpenType
language tags in string form to 4-byte tag form.  It's a little more
than a byteswap, note the bsearch call.

============== from hb-ot-tag.c ==============

hb_tag_t
hb_ot_tag_from_language (hb_language_t language)
{
  const char *lang_str;
  LangTag *lang_tag;

  if (language == NULL)
    return HB_OT_TAG_DEFAULT_LANGUAGE;

  lang_str = hb_language_to_string (language);

  if (0 == strcmp (lang_str, "ot:")) {
    char tag[4];
    int i;
    lang_str += 3;
    i = 0;
    while (i < 4 && lang_str[i]) {
      tag[i] = lang_str[i];
    }
    while (i < 4)
      tag[i] = ' ';
    return HB_TAG_STR (tag);
  }

  /* find a language matching in the first component */
  lang_tag = bsearch (lang_str, ot_languages,
              ARRAY_LENGTH (ot_languages), sizeof (LangTag),
              lang_compare_first_component);

  /* we now need to find the best language matching */
  if (lang_tag)
  {
    hb_bool_t found = FALSE;

    /* go to the final one matching in the first component */
    while (lang_tag + 1 < ot_languages + ARRAY_LENGTH (ot_languages) &&
       lang_compare_first_component (lang_str, lang_tag + 1) == 0)
      lang_tag++;

    /* go back, find which one matches completely */
    while (lang_tag >= ot_languages &&
       lang_compare_first_component (lang_str, lang_tag) == 0)
    {
      if (lang_matches (lang_str, lang_tag->language)) {
    found = TRUE;
    break;
      }

      lang_tag--;
    }

    if (!found)
      lang_tag = NULL;
  }

  if (lang_tag)
    return lang_tag->tag;

  return HB_OT_TAG_DEFAULT_LANGUAGE;
}

hb_language_t
hb_ot_tag_to_language (hb_tag_t tag)
{
  unsigned int i;
  unsigned char buf[8] = "ot:";

  for (i = 0; i < ARRAY_LENGTH (ot_languages); i++)
    if (ot_languages[i].tag == tag)
      return hb_language_from_string (ot_languages[i].language);

  buf[3] = tag >> 24;
  buf[4] = (tag >> 16) & 0xFF;
  buf[5] = (tag >> 8) & 0xFF;
  buf[6] = tag & 0xFF;
  return hb_language_from_string ((char *) buf);
}

==============================================

> The API though is designed to allow finer-level control over features.
>  For example, to allow disabling GPOS kerning and enabling old-style
> TrueType kerning. Moreover, HarfBuzz aims to be an SFNT text shaper,
> not necessarily limited to OpenType.  There's work on adding Graphite
> support already, and AAT can be added in the future.  Moreover, there
> is need to have a string-based API.  I mentioned in the thread that a
> separate uint32_t-based API may be added in the future.

For a library whose main purpose is to do OpenType layout this seems
like complete overdesign.  All OpenType language, script and feature
tags are 4-byte tags that use only ASCII charaters in the 0x20 to 0x7F
range. So there's plenty of space for Harfbuzz-only tags to support
things like kern table kerning (e.g. HB_OT_TAG_TTKERN = 0x80000001). 
Graphite uses similar tags.  If strings are required for AAT, add an
API when support for that is added.

This applies equally to hb_language_t, hb_script_t, and hb_feature_t.

> > Summary of my suggestions for Harfbuzz:
> > 
> > 1. Remove resource mgmt APIs, replace with callbacks for accessing font data.  
> >    No locking/refcounting/mem swizzling internally.
> 
> You seem to drastically under-estimate the headaches of lifecycle-management in
> a library.  Mozilla also disables cairo's mutex use.  So I don't think Mozilla
> is the perfect use case of those features.  When you are dealing with FT_Face
> and hb_face_t objects passed around between cairo, pango, poppler, Python
> wrappers, etc, the resource management API is absolutely necessary IMO.

I'm not underestimating it, I'm simply arguing that the code for doing
that resource management does not belong in Harfbuzz.

> > 2. Make the API and implemenentation C code if C-only code is a requirement,
> >    or normal C++ if it's not
> 
> There is no C-only requirement.  The requirement is C-only API.  And the C++
> code is standard.  Not sureewhat you mean by "normal C++".  Everyone uses their
> own subset of C++.  HarfBuzz is no exception.

There are C design patterns and C++ design patterns.  You're mixing them here
in ways that make the code hard to maintain.  Either use a simple set of C
structs and associated functions or create a set of straight-forward C++
classes
with a C-API.

> > 3. Simplify font table sanitizing code and move it to a utility library
> 
> Again, the sanitize code is an absolutely necessary part of the design.  Can't
> be moved out, and is not of much use outside harfbuzz as it doesn't try to be
> sanitize the tables to the word of the spec (such a sanitizer exists in
> FreeType, and is useless IMO other than to font designers).

Just because it's necessary doesn't mean it needs to be tied directly to the
code that uses it in ways that make caching the sanitized data difficult.