[HarfBuzz] Mapping output glyphs back to input character

Behdad Esfahbod behdad at behdad.org
Wed Jul 25 16:45:57 PDT 2012


On 07/25/2012 07:17 PM, Khaled Hosny wrote:
> On Sun, Jul 22, 2012 at 11:37:23PM -0400, Behdad Esfahbod wrote:
>> Hi Khaled,
>>
>> On 07/21/2012 05:49 AM, Khaled Hosny wrote:
>>> How do I map output glyphs back to input characters? I assume I've to
>>> use clusters for that, but I can't make much sense of the cluster
>>> numbers I'm seeing and don't seem to find any explanation for them.
>>
>> When you add text to a hb_buffer_t, you set a cluster number for each
>> character.  The functions hb_buffer_add_utf* implicitly use the index into the
>> input string for the cluster.  Ie. when using the UTF-8 version, UTF-8 indices
>> are used.
>>
>> Note that hb-view/hb-shape by default use UTF-32 cluster numbers (ie.
>> character-count instead of byte-count).  You can change that using
>> --utf8-clusters.
> 
> I’m using UTF-16 (playing with porting LibreOffice to HarfBuzz), so how
> surrogate pairs are handled?

See bottom of hb-buffer.cc.  "cluster" values after shaping hook back to
UTF-16 index in the original.

If you want to be more impactful, don't port LibreOffice, port iculayout!
It's probably 400 lines of code...

behdad

>> The shaping process implicitly segments the input text + output glyphs in a
>> series of clusters.  So you can think of, for LTR text, first cluster followed
>> by second cluster, followed by third cluster, etc, where each cluster contains
>> a number of characters and a number of glyphs.
>>
>> Now, the hb_glyph_info_t::cluster member after shaping simply points to the
>> minimum value of that member for all the characters that belong to the cluster.
>>
>> For RTL it's similar, though in reverse direction.
>>
>> Quick example.  If you add text for "differ", then initially characters get
>> cluster values 0,1,2,3,4,5 respectively.  After shaping, if the 'ff' ligature
>> was formed, you will get five glyphs, with cluster values 0,1,2,4,5.  This
>> means that the two characters that originally had cluster values 2 and 3 are
>> represented by the sole glyph having the cluster value 2.
>>
>> Hope that helps.
> 
> Thanks Behdad, this was very helpful.
> 
> Regards,
>  Khaled
> 



More information about the HarfBuzz mailing list