[HarfBuzz] Mapping output glyphs back to input character

Khaled Hosny khaledhosny at eglug.org
Wed Jul 25 16:17:00 PDT 2012


On Sun, Jul 22, 2012 at 11:37:23PM -0400, Behdad Esfahbod wrote:
> Hi Khaled,
> 
> On 07/21/2012 05:49 AM, Khaled Hosny wrote:
> > How do I map output glyphs back to input characters? I assume I've to
> > use clusters for that, but I can't make much sense of the cluster
> > numbers I'm seeing and don't seem to find any explanation for them.
> 
> When you add text to a hb_buffer_t, you set a cluster number for each
> character.  The functions hb_buffer_add_utf* implicitly use the index into the
> input string for the cluster.  Ie. when using the UTF-8 version, UTF-8 indices
> are used.
> 
> Note that hb-view/hb-shape by default use UTF-32 cluster numbers (ie.
> character-count instead of byte-count).  You can change that using
> --utf8-clusters.

I’m using UTF-16 (playing with porting LibreOffice to HarfBuzz), so how
surrogate pairs are handled?

> The shaping process implicitly segments the input text + output glyphs in a
> series of clusters.  So you can think of, for LTR text, first cluster followed
> by second cluster, followed by third cluster, etc, where each cluster contains
> a number of characters and a number of glyphs.
> 
> Now, the hb_glyph_info_t::cluster member after shaping simply points to the
> minimum value of that member for all the characters that belong to the cluster.
> 
> For RTL it's similar, though in reverse direction.
> 
> Quick example.  If you add text for "differ", then initially characters get
> cluster values 0,1,2,3,4,5 respectively.  After shaping, if the 'ff' ligature
> was formed, you will get five glyphs, with cluster values 0,1,2,4,5.  This
> means that the two characters that originally had cluster values 2 and 3 are
> represented by the sole glyph having the cluster value 2.
> 
> Hope that helps.

Thanks Behdad, this was very helpful.

Regards,
 Khaled



More information about the HarfBuzz mailing list