[HarfBuzz] Features, masks and glyph attribution.

Adam Twardoch (List) list.adam at twardoch.com
Mon Jan 28 12:07:07 PST 2013


This is an interesting question indeed.

Let me give a simple example why this can be challenging. It's not 100%
realistic, but could be replaced with a slightly more complex yet
realistic example.

Let's consider the string "fistfight" (
\u0066\u0069\uFB06\uFB01\u0067\u0068\u0074 ). I'm deliberately using
\uFB06 and \uFB01 here. The font's "cmap" table converts the string into
a glyph run:
/f/i/uniFB06/uniFB01/g/h/t

Let's consider the font that has the following features, both of them
being applied to the glyph run:

feature ccmp {
  sub uniFB06 by s t;
  sub uniFB01 by f i;
} ccmp;

feature liga {
  sub t f by t_f;
} liga;

After those features have been applied, we end up with the glyph run:
/f/i/s/t_f/i/g/h/t

As we can see, the two characters \uFB06\uFB01 are now represented as
three glyphs /s/t_f/i . Which glyphs get attributed to which characters?

When multiple lookups are applied, a font can split and combine glyphs
quite freely. On top of that, things like glyph reordering happens
within the Indic shapers.

I must admit that this is an aspect of text processing I don't know much
about. I don't know how Uniscribe solves this, for example. But I agree
that a solution to it would be useful.

(We should remember that for ligatures, the "GDEF" table provides caret
positions, which can/should be somehow worked into that.)

Best,
Adam

On 13-01-28 20:42, Alexander Sabourenkov wrote:
> Hello.
>
> I'm stuck in understanding more general aspects of HarfBuzz and
> shaping; reading and tracing the code went into diminishing returns
> mode.
>
> The task I'm struggling with is - after calling hb_shape(), map each
> resulting glyph to the unicode code point  in the initial string that
> caused the glyph in question to be emitted.
>
> I'm sorry if that doesn't parse, let me explain. I have an UCS-2
> string, without surrogates, where each character is associated with
> some data structure. Let's say it's just an integer value, an index of
> that character in the string.
>
> hb_shape() converts that to a sequence of glyphs. How do I know which
> glyph correspond to which character [index]?
>
> I don't think even the order of the glyphs is the same that of
> characters for RTL scripts. I suspect that one character may result in
> arbitrary number of glyphs.
>
> Reading the code let me to a hypothesis that user-defined hb_feature_t
> values can somehow end up in hb_glyph_info_t::mask (no obvious way to
> extract though).
> However, further work ended in that I'm not sure that's possible at all.
>
> Can someone please enlighten me on:
>
>  - is it possible at all in stock HarfBuzz? how?
>  - if not, what would be a reasonable way to hack that in, API-wise?
>  - what would be prospects of such a patch being merged?
>


-- 

May success attend your efforts,
-- Adam Twardoch
(Remove "list." from e-mail address to contact me directly.)




More information about the HarfBuzz mailing list