[HarfBuzz] HarfBuzz API design

Wed Aug 19 14:04:24 PDT 2009

On 08/19/2009 04:15 PM, Adam Twardoch wrote:
> I think it would be useful to have a helper function akin to Microsoft's
> ScriptItemizeOpenType()* that breaks a Unicode string into individually
> shapeable items (runs) and provides an array of feature tags for each
> shapeable item for OpenType processing.

Thanks Adam.  So far the focus has been to unify the shaping logic (most 
important for Indic).  Itemization, while pretty well defined, is something 
everyone does slightly differently.  It requires:

   - Applying Unicode Bidi Algorithm,

   - Script tagging heuristic for Script=Common characters

   - Language tagging heuristic

   - Font assignment

Except for the first item which is well-defined by Unicode, the other steps 
are less well-defined and different usecases require slightly different 
solutions.  For example, web browsers have very strict font assignment rules 
that follow the CSS spec.  Other applications, less so.  It would be harder to 
justify using a unified itemizer.  At least initially.  But yes, that's one of 
the logical next steps.

behdad

> * http://msdn.microsoft.com/en-us/library/dd368557%28VS.85%29.aspx
>
> Adam
>
> Behdad Esfahbod wrote:
>> On 08/19/2009 02:57 AM, Martin Hosken wrote:
>>> Dear Behdad,
>>>
>>> I feel that this is the core of the API since it specifies what inputs and outputs harfbuzz works with (particularly outputs).
>>
>> Hi Martin,
>>
>> Yes, hb_shape() and the hb_glyph_info_t are essentially the core of the API.
>>
>>
>>>> typedef struct _hb_glyph_info_t {
>>>>      hb_codepoint_t codepoint;
>>>>      hb_mask_t      mask;
>>>>      uint32_t       cluster;
>>>>      uint16_t       component;
>>>>      uint16_t       lig_id;
>>>>      uint32_t       internal;
>>>> } hb_glyph_info_t;
>>> I may have misinterpretted but mask, lig_id and probably component, feel to be OT specific in that a consumer of the output is unlikely to ever need them.
>>
>> Yes and no.  Mask is used to mark which user features should be applied to
>> which glyphs, and I think at least AAT can/will use that too.  For lig_id and
>> component, they are not inherently OT-specific.  They are implementation
>> details of how HarfBuzz implements the OT spec.  We may decide to hide them
>> too, and just have another internal member.  Individual shapers can use the
>> internal members as they wish then.  That's actually a good idea.  Unless I
>> find a use for the client having access to those values, it better be hidden.
>>    I'll make that change now.
>>
>> I'm thinking about adding a some other fields here though (without changing
>> the size).  Things like justification points, etc.
>>
>>
>>> The disadvantage I see with having a single buffer that changes its contents from chars to glyphs is that then you lose the association map between underlying chars and glyphs. I suppose it can be recreated using the component information, but it's going to be problematic when it comes to cursor hit testing.
>>
>> The decision is only relevant inside the hb_shape() call.  The user has the
>> original text still.  Please see the last part of my reply to Carl Worth.
>>
>>
>>>> For script and language, it's a bit more delicate.  I'm also convinced that
>>>> they belong to the buffer.  With script it's fine, but with language it
>>>> introduces a small implementation hassle: that I would have to deal with
>>>> copying/interning language tags, something I was trying to avoid.  The other
>>>> options are:
>>>>
>>>>      - Extra parameters to hb_shape().  I rather not do this.  Keeping details
>>>> like this out of the main API and addings setters where appropriate makes the
>>>> API cleaner and more extensible.
>>>>
>>>>      - Use the feature dict for them too.  I'm strictly against this one.  The
>>>> feature dict is already too highlevel for my taste.
>>> Why do you say the feature dict is too high level? It seems just the right place, to me. Or it could be stored in the buffer, since it is buffer specific.
>>
>> It's just not as efficient and easy to use as I like.  But it's just fine for
>> user features, yes.
>>
>>
>>> One question: is a buffer representing a single run for which the language doesn't change or is it potentially multiple runs that are yet to be segmented?
>>
>> The way I'd recommend using it is for one run.  The API already limits it to
>> one font anyway.  Doesn't mean we can't add API to do multiple runs in the
>> future though.
>>
>> behdad
>>
>>> Yours,
>>> Martin
>>>
>> _______________________________________________
>> HarfBuzz mailing list
>> HarfBuzz at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>>
>>
>
>