[HarfBuzz] Fwd: harfbuzz work

Wed Aug 5 14:40:20 PDT 2009

Hi Jonathan and others,

Finally I have some more code to show.  I actually made my branch into pango's 
master branch, so harfbuzz-ng lives in pango's master branch for now.

More comments inline.

On 07/15/2009 01:30 PM, Jonathan Kew wrote:
> This was originally written to Behdad, but copying the HB mailing list
> as it may be of interest to others. Feedback welcome. :)
>
> JK
>
> Begin forwarded message:
>
>> From: Jonathan Kew <jonathan at jfkew.plus.com>
>> Date: 24 June 2009 19:18:33 BST
>> To: Behdad Esfahbod <behdad at behdad.org>
>> Subject: harfbuzz work
>>
>> Hi Behdad,
>>
>> FYI, I'm attaching some experimentation I've been doing with HarfBuzz.
>> This is based on your harfbuzz-ng from *before* the most recent commit
>> ("XX") to that branch, as it appeared to be in a somewhat broken (or
>> should I say partially-updated) state there.

Yeah, I've finished that patch now and added a whole bunch more.

>> The zip file contains new stuff I've been writing, working towards a
>> HarfBuzz-based module we could use in Gecko, without relying on
>> anything else in Pango. There are also a few modifications to your
>> code in pango/opentype, attached as a separate diff file.

Committed the patch.  Thanks.

>> What I've done here - some of which you may want to take into HarfBuzz
>> itself, unless you already have better solutions:
>>
>> * Alternate layout constructor taking pointers to the OpenType tables;
>> I'm using this on OS X at the moment as it's the most convenient way
>> to provide the font data. We won't always have an actual file
>> available for the mmap() approach, though of course that's ideal when
>> we can use it.

I finally added the highlevel public hb_font_t and hb_face_t API.  There is a 
constructor that takes a get_table callback...  I'll write a more detailed 
mail about the API later.

>> * In hb-buffer, made hb_buffer_ensure() public as it could be useful
>> for client code to preallocate space, if it knows how much text is
>> coming; also gave hb_buffer_new() a size parameter so that the caller
>> can ask for an initial allocation size.

Agreed.

>> * More importantly, I think hb_buffer_ensure() had a bug in the case
>> where out_string == in_string; it was realloc'ing in_string before
>> checking whether the pointers were the same, which means the in_string
>> pointer is likely to have been changed and the wrong branch will be
>> chosen. I think this is fixed correctly in the attached patch.

Thanks.  There was another bug also that I fixed.

>> * Provided a small HB-friendly cmap-reader (currently handles formats
>> 4 and 12 only).

I thought a lot about whether we want to deal with cmap directly.  There are 
multiple reasons not to:

   - fontconfig for example, can handle non-Unicode cmap's by calling iconv,

   - For characters not supported by the font, we need to ask the higher level 
what to do.  Pango uses special code that are used to draw hexboxes later.

For the above two reasons, I think it would be better to use a callback for 
cmap conversion.

>> * A script-run itemizer based on ICU's, but adapted to support text in
>> any of UTF-8, 16, or 32 (not actually tested with them all yet, though).

Again, not sure if I want to keep the itemizer in harfbuzz.  I think I'll 
figure out as we move forward.

As for UTF-8/16/32 my current plan is to add API that imports those into a 
hb_buffer_t, and everything else works on the buffer.

>> * Code to look up the Unicode character properties we're likely to
>> need; currently script, bidi direction, and arabic joining type. This
>> can be retrieved from the ICU property APIs, if the client is using
>> ICU anyway, or there's a local implementation supporting just the
>> properties needed in the layout process. Actually, as we don't do bidi
>> within HarfBuzz, I'm not sure we need that property; on the other
>> hand, we may need character types (combining marks, etc) for cluster
>> handling - I haven't looked into that yet.

Again, these all be taken care of by what I'm currently calling 
hb_unicode_callbacks_t.  For testing, I'd rather use glib's instead of having 
scripts to extract them in yet another place.

>> * Proposed shaping-function API (see hb-shaper.h) and two shaper
>> implementations (generic and arabic/syriac/n'ko). These support
>> user-specified features in addition to the defaults and
>> script-specific shaping features. Oh, they also handle mirroring using
>> the OMPL table, and apply ltra/rtla etc according to direction.

Thanks.  I get to them soon.  Regarding OPML, I'm of mixed mind.  I personally 
prefer to use the latest Unicode mirroring properties instead.  The idea of 
fixing on OPML was stupid IMO.

>> In the shaper API that I'm using right now, the approach is to
>> initially fill the buffer with *character* codes, and the shaper
>> function takes a pointer to a cmap table in addition to the layout
>> record. I did this because shaping needs access to the Unicode values,
>> not just the glyphs. I suppose we could specify that the cmap table
>> can be NULL, in which case the buffer is assumed to contain glyph IDs
>> already, but this will make most complex-script shaping impossible.
>> (Actually, it's a problem even for the generic shaper, as it needs the
>> Unicode character codes for mirroring.)

That's kinda what I have in mind, yes.  I'm actually think of hb_shape() 
calling the following four functions:

   hb_substitute_default()  -> does cmap conversion
   hb_substitute_complex()  -> does GSUB substitution

   hb_position_default()    -> does default glyph-metrics positioning
   hb_position_complex()    -> does GPOS positioning

Better naming welcome.

>> Assuming we use this model of making the shaper be responsible for
>> mapping Unicode to glyphs, should the cmap table be incorporated into
>> the layout record just like GDEF/GSUB/GPOS? I did it separately for
>> now just to minimize disruption to your opentype files, but there's
>> not much reason to keep it separate IMO.

The new API allows us to load any table we want in the future with no API 
change, so that's really an implementation detail.

>> One outstanding issue is passing parameters to features like 'aalt'
>> (alternate substitution lookups). I see you have a "placeholder" for a
>> callback function in AlternateSubstFormat1::apply, but this doesn't
>> look quite sufficient AFAICT. In order to return the proper index, the
>> function would need to know which feature is currently being
>> processed, which is information that is not available at this level of
>> applying the lookup. (Note that it would be possible for a run of text
>> to have several Alternate features applied, with different indexes
>> used for each of them.)

I'm finally convinced that we don't want a callback approach.  I think I have 
something in mind that may work.  The idea being, the mask for the feature can 
have more than 1 bit on, and we use those bits of the glyph property as a 
selection.  Makes sense?

>> I'm wondering whether it would be feasible to use the "mask" parameter
>> to hb_ot_layout_{substitute,position}_lookup to help here. This is
>> used to selectively switch lookups off for certain glyphs in the
>> buffer, in order to implement things like Arabic shaping, but if we
>> could assume that the shapers should never need more than 24 bits for
>> this purpose (will a shaper ever need individual control of 24
>> distinct features or sets of features?), then we could also use the
>> low byte of the mask to pass a "feature argument" through to the
>> lookups. Currently, the mask is not passed all the way down to the
>> individual subtable apply() functions, so this would need to be done,
>> but I don't think that would be hard, and it would allow a specific
>> alternate index associated with a feature to be passed on to that
>> feature's lookup(s) and used to choose the right alternate. What do
>> you think - should I give this a try and see how it works in practice?

I'll give it a try as I port more code into the hb_* namespace.

Thanks,
behdad

>> Regards,
>>
>> Jonathan
>>
>>
>>
>