[HarfBuzz] harfbuzz review
John Daggett
jdaggett at mozilla.com
Thu Apr 15 23:19:10 PDT 2010
Behdad Esfahbod wrote:
> >> You seem to drastically under-estimate the headaches of
> >> lifecycle-management in a library. Mozilla also disables cairo's
> >> mutex use. So I don't think Mozilla is the perfect use case of
> >> those features. When you are dealing with FT_Face and hb_face_t
> >> objects passed around between cairo, pango, poppler, Python
> >> wrappers, etc, the resource management API is absolutely
> >> necessary IMO.
> >
> > I'm not underestimating it, I'm simply arguing that the code for
> > doing that resource management does not belong in Harfbuzz.
>
> "resource management" != object lifecycle management.
I'm not arguing that resource management and/or object lifecycle
management aren't needed, I'm simply saying that code for doing these
functions belongs outside the core shaping code and is better-handled
via an API that apps can tune to fit their needs.
> Allowing the user code to cache the sanitized tables is planned,
> just not provided right now.
Ok, great.
> You ignored part of my reply that I said I have plans to add API to
> decouple sanitizing from layout. In my design, when that API is
> added, you get the functionality you are asking for. Remains your
> request that the sanitize code is put in a separate module. I don't
> see any benefit in doing that.
Hmm, if by decoupling you mean that it's possible to provide the
shaping routines with pre-sanitized font tables or some other
sanitized intermediate form, that's great. Then apps can use a
harfbuzz utility routine to sanitize font tables when necessary.
> > App usage patterns vary. In the case of GSUB/GPOS tables, the
> > sanitized table structure itself may not be the ideal structure to
> > cache because untangling all the interrelated intra-table and
> > inter-table references requires still more processing before use.
>
> What kind of more processing? Are you suggesting that I should
> parse GSUB/GPOS tables into an internal structure and serialize into
> a different format? No thank you.
No, I'm not suggesting a different serialization format, I'm saying
that instead of caching the full GSUB/GPOS tables it might be better
to cache some data structure that's been distilled for a given purpose.
For a lot of documents, hb_shape will be called repeatedly with the
same font/script/lang combination so it would make sense to cache the
lookups for that font/script/lang combination rather than
reconstructing it on each call to hb_shape. Basically the lookup
arrays produced by calling setup_lookups in _hb_ot_substitute_complex
and _hb_ot_position_complex.
> >> Going from tags to strings and vice versa is at most a byteswap.
> >
> > Below is the code in hb-ot-tag.c that converts between OpenType
> > language tags in string form to 4-byte tag form. It's a little
> > more than a byteswap, note the bsearch call.
>
> Wrong. The expensive part is converting from RFC-3066 / BCP-49
> language tags to OpenType tags. If I drop that code, Firefox,
> Pango, and every other user of harfbuzz has to do that non-trivial
> conversion. Where do you think '<html lang="fa">' is converted to
> 'FAR '? I don't see why that code doesn't belong into harfbuzz when
> the HB API is designed for humans who don't necessarily know about
> OpenType.
I'm sure you're right, converting from RFC-3066 / BCP-49 language tags
is more expensive in comparison. But by using strings to identify
language, script and feature tags you're forcing the conversion from
these strings to OpenType tags to occur *every* time a buffer is
passed to hb_shape. For a given document using a single language that
conversion really only needs to happen once for all buffers passed to
hb_shape. Ditto for script and feature tags.
I think it's fine to have a utility API that does the conversion from
a string-based language tag to 4-byte tags, equally usable by humans
or four-legged creatures.
> >> It's definitely more complex than the code it's replacing (which was
> >> plain C). However, it's very carefully designed, easier to audit,
> >> and I'm much more confident in it than in the code it replaces.
> >
> > Complex code for dealing with complex data is one thing but the
> > Harfbuzz code makes even simple things complicated in some cases. Take
> > as an example some of the struct's defined in hb-open-type-private.hh,
> > there's a whole set of *structs* defined for basic data types. There
> > are far, far simpler ways of dealing with endian data than this. What
> > advantage does this complexity add? Does it somehow simplify the code?
>
> I'm all ears to hear about those "far, far simpler ways of dealing with endian
> data". What this code does is to obviate the need to call a bytecode
> conversion function/macro/... in thousands of places, each of which could be
> omitted by mistake. So, review the ten lines and you're confident that there
> is no endian bugs in that 5kloc.
I agree, using structs for endian swaps is a good thing but you're
also "sanitizing" int's there! Unfolding all the macros there it
looks like it's checking to see that the int field is within a given
bound; couldn't this be inferred with a bounds check on the containing
struct?
A typical endian-swapping struct might look something like this (this
is an aligned version but an unaligned version isn't that different):
#define SWAP32(a) /* swaps if needed depending on machine env */
struct ULONG {
ULONG(unsigned u) : v(SWAP32(u)) { }
operator unsigned() const { return SWAP32(v); }
unsigned v;
};
Used in a TrueType table directory:
struct TableDirectory {
ULONG tag;
ULONG checkSum;
ULONG offset;
ULONG length;
}
> Other than endianness, this chunk of code also defines int types
> that have no alignment requirements.
But table data in TrueType fonts is typically 4-byte aligned. Is
there a situation where long/short data fields within a TrueType table
are odd-byte aligned? Or you just want the code to handle the rare
case where a poorly made font has non-4-byte aligned tables?
Cheers,
John
More information about the HarfBuzz
mailing list