[HarfBuzz] harfbuzz review

Thu Apr 15 11:28:25 PDT 2010

On 04/15/2010 02:20 AM, John Daggett wrote:

>> It's definitely more complex than the code it's replacing (which was
>> plain C). However, it's very carefully designed, easier to audit,
>> and I'm much more confident in it than in the code it replaces.
> 
> Complex code for dealing with complex data is one thing but the
> Harfbuzz code makes even simple things complicated in some cases. Take
> as an example some of the struct's defined in hb-open-type-private.hh,
> there's a whole set of *structs* defined for basic data types.  There
> are far, far simpler ways of dealing with endian data than this.  What
> advantage does this complexity add?  Does it somehow simplify the code?

I'm all ears to hear about those "far, far simpler ways of dealing with endian
data".  What this code does is to obviate the need to call a bytecode
conversion function/macro/... in thousands of places, each of which could be
omitted by mistake.  So, review the ten lines and you're confident that there
is no endian bugs in that 5kloc.

Other than endianness, this chunk of code also defines int types that have no
alignment requirements.

The use of a macro is to avoid copy/paste'ing the those ten lines 4 times (for
USHORT, SHORT, ULONG, LONG).

Of the chunk you pasted, one is an optimized version of the other one, and has
a TODO item before it that you didn't quote:

/* TODO On machines that allow unaligned access, use this version. */
#define _DEFINE_INT_TYPE1_UNALIGNED(NAME, TYPE, BIG_ENDIAN, BYTES) \
   ...
#define DEFINE_INT_TYPE1(NAME, TYPE, BIG_ENDIAN, BYTES) \
   ...

DEFINE_INT_TYPE (USHORT, u, 16);        /* 16-bit unsigned integer. */
DEFINE_INT_TYPE (SHORT,   , 16);        /* 16-bit signed integer. */
DEFINE_INT_TYPE (ULONG,  u, 32);        /* 32-bit unsigned integer. */
DEFINE_INT_TYPE (LONG,    , 32);        /* 32-bit signed integer. */

And then you also copied the struct definition for Tag.  Not sure why.

Using C++ classes to do byteswap and other conversions is a common technique
AFAIK.  That's hardly the most complex part of the code.

> The point is not that it compiles correctly, it's whether the code is
> simple and easy to maintain.  If all text in Firefox is going to pass
> through this code we need for it to be sufficiently simple and easy to
> debug by an engineer unfamiliar with the code.

Fair enough.

> Using complicated nested macro-ized structs makes that job much, much harder.

Does using simple templates instead make you happier?

>>> There are also parts of the API that involve resource management
>>> that I think really don't belong there.  For example, font tables
>>> are handled as arbitrary binary objects, hb_blob's:
>>
>> At the outset they are binary objects.  What's wrong with that?
> 
> Because font tables have a much more specific usage pattern than
> arbitrary binary data.  Why obfuscate that fact?

You talk so vaguely about "resource management" that I'm not sure what to
reply.  What's your proposed API without hb_blob_t?  Tell me and I show you
the problem with it.  hb_blob_t is a simple way to refcount the table data and
have someone clean it up when it's not needed anymore.  How do you suggest I
should do that instead?

>>> Font table sanitizing really belongs in a separate font management
>>> layer that Harfbuzz accesses.
>>
>> I've considered that before.  The problem however is: there are
>> numerous fonts out there that fail the literal word of the spec in
>> one way or another.  Using an external sanitizer will result in
>> rejecting the layout tables in that fonts. The aim of the harfbuzz
>> sanitizer is to clean that up just enough such that we can use the
>> correct bits of the tables.  It must closely match the code using
>> the tables and that's why it's tied closely to the layout code
>> (instead of being put in a separate module logically).
> 
> Sorry, can't an external sanitizer do the clean up you suggest?  
> 
> I'm suggesting that the sanitizing happen outside the shaper code, so
> that apps are free to manage the sanitized data based on their needs. 
> For example, a lot of platform fonts may already by sufficiently
> sanitized that this isn't needed, the code could easily cache that
> fact and skip the sanitizing process the next time.  Or for platforms
> with a locked-down set of fonts that sanitizing could take place as
> part of the build process, no need to do it at runtime.  The current
> code doesn't permit this, it ties the sanitization to the create of a
> layout object.

You ignored part of my reply that I said I have plans to add API to decouple
sanitizing from layout.  In my design, when that API is added, you get the
functionality you are asking for.  Remains your request that the sanitize code
is put in a separate module.  I don't see any benefit in doing that.

>> The Chrome sanitizer doesn't check the OpenType Layout tables. 
>> Given all the offsets in the GSUB/GPOS/GDEF tables, reconstructing
>> them is a tedious task, and would be pointless when one can do with
>> the much simpler sanitizer in harfbuzz.
> 
> Right, Chrome dumps the GSUB/GPOS/GDEF tables.  Reconstructing or
> sanitizing are the same thing, just with different degrees of
> strictness.  If it's tedious for these tables, all the more reason to
> allow the sanitized versions to be easily cached outside of Harfbuzz
> code.  That's sort of what the hb_blob memory mode stuff is trying to
> allow.

Allowing the user code to cache the sanitized tables is planned, just not
provided right now.

>> In the future, I'll add API and cmdline tools to sanitize fonts.  On
>> Linux for example we'll cache the sanitizing result in the
>> fontconfig cache and skip sanitizing for sane fonts.  Other
>> environments will be able to do similarly.  It's planned, just not
>> there yet.
> 
> The fontconfig cache is a good example of a more appropriate place for
> resource management.
> 
>>> I also think the degree of caching will depend a lot on access
>>> patterns and the size and complexity of the data in a given table.
>>> For small, simple tables, simply caching the table data may make
>>> the most sense.  But for tables like the GSUB/GPOS tables with all
>>> their nested offsets, some other form of data structure could
>>> easily make better sense in a given environment.
>>
>> Not sure what kind of caching you are referring to.  Harfbuzz itself
>> doesn't do any caching.
> 
> That's my point, Harfbuzz doesn't allow the caching of sanitized font
> tables.

It will.  Soon.  Don't confuse caching sanitized font tables with lifecycle
management though.  hb_blob_t is mostly about lifecycle and memory management.

>  It would be better to allow apps to do this resource
> management rather than Harfbuzz.

It *is* up to the apps.  The apps can tell HarfBuzz whether it can modify the
font data in place, modify making a copy, or not modify at all.  Just need to
provide the public sanitize API, then apps can sanitize and cache the tables
for later use.

>  App usage patterns vary.  In the
> case of GSUB/GPOS tables, the sanitized table structure itself may not
> be the ideal structure to cache because untangling all the
> interrelated intra-table and inter-table references requires still
> more processing before use.

What kind of more processing?  Are you suggesting that I should parse
GSUB/GPOS tables into an internal structure and serialize into a different
format?  No thank you.

>> Going from tags to strings and vice versa is at most a byteswap.
> 
> Below is the code in hb-ot-tag.c that converts between OpenType
> language tags in string form to 4-byte tag form.  It's a little more
> than a byteswap, note the bsearch call.

Wrong.  The expensive part is converting from RFC-3066 / BCP-49 language tags
to OpenType tags.  If I drop that code, Firefox, Pango, and every other user
of harfbuzz has to do that non-trivial conversion.  Where do you think
'<html lang="fa">' is converted to 'FAR '?  I don't see why that code doesn't
belong into harfbuzz when the HB API is designed for humans who don't
necessarily know about OpenType.

>> The API though is designed to allow finer-level control over features.
>>  For example, to allow disabling GPOS kerning and enabling old-style
>> TrueType kerning. Moreover, HarfBuzz aims to be an SFNT text shaper,
>> not necessarily limited to OpenType.  There's work on adding Graphite
>> support already, and AAT can be added in the future.  Moreover, there
>> is need to have a string-based API.  I mentioned in the thread that a
>> separate uint32_t-based API may be added in the future.
> 
> For a library whose main purpose is to do OpenType layout this seems
> like complete overdesign.  All OpenType language, script and feature
> tags are 4-byte tags that use only ASCII charaters in the 0x20 to 0x7F
> range. So there's plenty of space for Harfbuzz-only tags to support
> things like kern table kerning (e.g. HB_OT_TAG_TTKERN = 0x80000001). 
> Graphite uses similar tags.  If strings are required for AAT, add an
> API when support for that is added.
> 
> This applies equally to hb_language_t, hb_script_t, and hb_feature_t.

I already addressed hb_language_t.  hb_script_t also is Unicode scripts, which
is not necessarily the same as OpenType scripts (OpenType lags behind).  I
also addressed your hb_feature_t concern already: I said I'm considering
adding API that takes hb_tag_t instead of strings.  Not sure why you are still
not happy.

>>> Summary of my suggestions for Harfbuzz:
>>>
>>> 1. Remove resource mgmt APIs, replace with callbacks for accessing font data.  
>>>    No locking/refcounting/mem swizzling internally.
>>
>> You seem to drastically under-estimate the headaches of lifecycle-management in
>> a library.  Mozilla also disables cairo's mutex use.  So I don't think Mozilla
>> is the perfect use case of those features.  When you are dealing with FT_Face
>> and hb_face_t objects passed around between cairo, pango, poppler, Python
>> wrappers, etc, the resource management API is absolutely necessary IMO.
> 
> I'm not underestimating it, I'm simply arguing that the code for doing
> that resource management does not belong in Harfbuzz.

"resource management" != object lifecycle management.

>>> 2. Make the API and implemenentation C code if C-only code is a requirement,
>>>    or normal C++ if it's not
>>
>> There is no C-only requirement.  The requirement is C-only API.  And the C++
>> code is standard.  Not sureewhat you mean by "normal C++".  Everyone uses their
>> own subset of C++.  HarfBuzz is no exception.
> 
> There are C design patterns and C++ design patterns.  You're mixing them here
> in ways that make the code hard to maintain.  Either use a simple set of C
> structs and associated functions or create a set of straight-forward C++
> classes with a C-API.

I don't think using certain language features to simplify the code / make it
easier to prove correct is an unacceptable use of the language.  I think we
just need to agree to disagree on this part.

>>> 3. Simplify font table sanitizing code and move it to a utility library
>>
>> Again, the sanitize code is an absolutely necessary part of the design.  Can't
>> be moved out, and is not of much use outside harfbuzz as it doesn't try to be
>> sanitize the tables to the word of the spec (such a sanitizer exists in
>> FreeType, and is useless IMO other than to font designers).
> 
> Just because it's necessary doesn't mean it needs to be tied directly to the
> code that uses it in ways that make caching the sanitized data difficult.

I think I addressed that three times now.

Cheers,
behdad