[HarfBuzz] Trouble with clusters and accented latin characters
Behdad Esfahbod
behdad at behdad.org
Sun Oct 14 10:38:24 PDT 2012
On 12-10-14 12:31 PM, Lóránt Pintér wrote:
> Hi,
>
> I'm trying to shape the word "tér" with HarfBuzz, and this is what I get back:
>
> hb_buffer_get_glyph_infos() after calling hb_buffer_add_utf8():
>
> Char #0: { codepoint: 116, mask: 1, cluster: 0, var1: 0, var2: 0 }
> Char #1: { codepoint: 233, mask: 1, cluster: 1, var1: 0, var2: 0 }
> Char #2: { codepoint: 114, mask: 1, cluster: 3, var1: 0, var2: 0 }
>
> …and after calling hb_shape():
>
> Glyph #0: { codepoint: 86, mask: 1, cluster: 0, var1: 2, var2: 5 }
> Glyph #1: { codepoint: 156, mask: 1, cluster: 1, var1: 2, var2: 5 }
> Glyph #2: { codepoint: 84, mask: 1, cluster: 3, var1: 2, var2: 5 }
>
> I believed up to now that each cluster corresponded to a character in the
> original string. Why is the letter "é" turned into two clusters here?
When you use add_utf8, cluster values are set to UTF-8 indices into the
original string. The precomposed "é" letter takes two bytes in UTF-8, that's
why you see what you see. If you prefer plain character-index instead, just
loop over and set the cluster values before calling shape. This is from
hb/util/options.hh for example:
if (!utf8_clusters) {
/* Reset cluster values to refer to Unicode character index
* instead of UTF-8 index. */
unsigned int num_glyphs = hb_buffer_get_length (buffer);
hb_glyph_info_t *info = hb_buffer_get_glyph_infos (buffer, NULL);
for (unsigned int i = 0; i < num_glyphs; i++)
{
info->cluster = i;
info++;
}
}
behdad
More information about the HarfBuzz
mailing list