[HarfBuzz] Trouble with clusters and accented latin characters

Sun Oct 14 10:38:24 PDT 2012

On 12-10-14 12:31 PM, Lóránt Pintér wrote:
> Hi,
> 
> I'm trying to shape the word "tér" with HarfBuzz, and this is what I get back:
> 
> hb_buffer_get_glyph_infos() after calling hb_buffer_add_utf8():
> 
> Char #0: { codepoint: 116, mask: 1, cluster: 0, var1: 0, var2: 0 }
> Char #1: { codepoint: 233, mask: 1, cluster: 1, var1: 0, var2: 0 }
> Char #2: { codepoint: 114, mask: 1, cluster: 3, var1: 0, var2: 0 }
> 
> …and after calling hb_shape():
> 
> Glyph #0: { codepoint: 86, mask: 1, cluster: 0, var1: 2, var2: 5 }
> Glyph #1: { codepoint: 156, mask: 1, cluster: 1, var1: 2, var2: 5 }
> Glyph #2: { codepoint: 84, mask: 1, cluster: 3, var1: 2, var2: 5 }
> 
> I believed up to now that each cluster corresponded to a character in the
> original string. Why is the letter "é" turned into two clusters here?

When you use add_utf8, cluster values are set to UTF-8 indices into the
original string.  The precomposed "é" letter takes two bytes in UTF-8, that's
why you see what you see.  If you prefer plain character-index instead, just
loop over and set the cluster values before calling shape.  This is from
hb/util/options.hh for example:

    if (!utf8_clusters) {
      /* Reset cluster values to refer to Unicode character index
       * instead of UTF-8 index. */
      unsigned int num_glyphs = hb_buffer_get_length (buffer);
      hb_glyph_info_t *info = hb_buffer_get_glyph_infos (buffer, NULL);
      for (unsigned int i = 0; i < num_glyphs; i++)
      {
	info->cluster = i;
	info++;
      }
    }

behdad