[HarfBuzz] Thoughts on harfbuzz API

Sun Dec 6 19:33:48 PST 2009

Dear Behdad,

> > I'm in the process of writing a python wrapper to help with testing harfbuzz before hopefully integrating Graphite. This gives me a good way to review the API :) and here are some thoughts.

I've got it to the point that I can print out a list of glyphs & positions, etc. from a text buffer and a font. So that's probably far enough to be useful.

> > 1. Features
> >
> > Currently a feature in hb-shape.h is defined as an association between two char * over a range. My understanding of all smart font technologies is that they work with longs. So I would suggest making the name and value entries unsigned longs rather than char *.
> 
> That may be true, but from a user point of view, I'd rather keep it as generic 
> as possible.  Jonathan and I discussed also providing an integer API, and that 
> most probably will happen at some point, but I want to keep the hb_shape() API 
> as is.

But that openness comes with a cost. The cost is that the mapping between the input and what is stored in the font has to be thoroughly described. Let me take each of the 3 aspects (features, lang, script) in turn.

1. Features

Inside a font a feature identifier is either a 16-bit number in the case of AAT or a 32-bit tag in OT or a 32-bit num/tag in Graphite. Both AAT and Graphite have an optional linkage from a feature identifier to a language string for its name. Now how might we interpret the feature identifier string (name)? It could be an ascii number which is converted to either a 16 bit or 32 bit number, or it could be a 4 char tag that gets converted to a long or it could be a UI level name that has to be interpretted via a specified (or defaulted) language id and the name table. Ultimately, I would suggest that it has to map down to a long (which for AAT can be further truncated to a 16-bit id). Given that the choice of what the input char * may be is up to the calling application, I would suggest that the mapping is best done there and just pass in the long. Thus reducing the complexity of harfbuzz.

Likewise for the value of a feature, again it has to get down to a number, in this case. In the case of OT it can be more than just 0 or 1 as some newer features take a numeric parameter. So I would suggest for the ease of harfbuzz it is passed as a long.

There is nothing to stop us later adding helper functions that can fill in the entries of a feature struct from char *s. But I would suggest we start simply.

2. Langs

I was about to write a similar argument for langs, but then realised you are right. The lang identifier should be a full string. My main concern here is that the list of languages supported by harfbuzz be open. I think your current solution works well: allowing an initialised cache and caching the rest.

3. Scripts

As for languages, I think we have an opportunity here to make harfbuzz resilient against Unicode version changes. If the script is passed as a string instead of as a member of an enum, then there is no enum that has to be updated every Unicode release with all the new scripts that have been added. It's a simple matter to dictate that the string is interpretted via ISO 15924. This will make harfbuzz more stable, especially when it ends up in embedded devices without an annual upgrade cycle.

This is not to say that a segmenter can't work with a closed set of scripts (although the more that can be done to open such things up, the better). Also the mapping from script to shaper in OT would become a search (binary perhaps) rather than a simple array lookup. But I think the gained forward compatibility would be worth the cost.

Yours,
Martin