[HarfBuzz] Mapping output glyphs back to input character
Khaled Hosny
khaledhosny at eglug.org
Wed Jul 25 16:17:00 PDT 2012
On Sun, Jul 22, 2012 at 11:37:23PM -0400, Behdad Esfahbod wrote:
> Hi Khaled,
>
> On 07/21/2012 05:49 AM, Khaled Hosny wrote:
> > How do I map output glyphs back to input characters? I assume I've to
> > use clusters for that, but I can't make much sense of the cluster
> > numbers I'm seeing and don't seem to find any explanation for them.
>
> When you add text to a hb_buffer_t, you set a cluster number for each
> character. The functions hb_buffer_add_utf* implicitly use the index into the
> input string for the cluster. Ie. when using the UTF-8 version, UTF-8 indices
> are used.
>
> Note that hb-view/hb-shape by default use UTF-32 cluster numbers (ie.
> character-count instead of byte-count). You can change that using
> --utf8-clusters.
I’m using UTF-16 (playing with porting LibreOffice to HarfBuzz), so how
surrogate pairs are handled?
> The shaping process implicitly segments the input text + output glyphs in a
> series of clusters. So you can think of, for LTR text, first cluster followed
> by second cluster, followed by third cluster, etc, where each cluster contains
> a number of characters and a number of glyphs.
>
> Now, the hb_glyph_info_t::cluster member after shaping simply points to the
> minimum value of that member for all the characters that belong to the cluster.
>
> For RTL it's similar, though in reverse direction.
>
> Quick example. If you add text for "differ", then initially characters get
> cluster values 0,1,2,3,4,5 respectively. After shaping, if the 'ff' ligature
> was formed, you will get five glyphs, with cluster values 0,1,2,4,5. This
> means that the two characters that originally had cluster values 2 and 3 are
> represented by the sole glyph having the cluster value 2.
>
> Hope that helps.
Thanks Behdad, this was very helpful.
Regards,
Khaled
More information about the HarfBuzz
mailing list