[HarfBuzz] What is wrong with unicode in harfbuzz?

Fri Jun 17 01:35:03 UTC 2016

When I run a simple harfbuzz shaping like

string = 'In begíffi our '
> utfstring = string.encode('utf-8')
>
> buf = hb.buffer_create()
> hb.buffer_add_utf8(buf, utfstring, 0, -1)
> hb.buffer_guess_segment_properties(buf)
>
> hb.shape(font, buf, [])
> infos = hb.buffer_get_glyph_infos(buf)
> positions = hb.buffer_get_glyph_positions(buf)
>

I get

len(string) = 15
len(infos) = 13
len(positions) = 13

which makes sense, three glyphs became one so 15 characters makes 13
glyphs. But the cluster values are wrong because they don’t line up with
the character indexes any more (because of the accented character).

But then when I change it to utf-16

string = 'In begíffi our '
> utfstring = string.encode('utf-16')
>
> buf = hb.buffer_create()
> hb.buffer_add_utf16(buf, utfstring, 0, -1)
> hb.buffer_guess_segment_properties(buf)
>
> hb.shape(font, buf, [])
> infos = hb.buffer_get_glyph_infos(buf)
> positions = hb.buffer_get_glyph_positions(buf)
>

I get

len(string) = 15
len(infos) = 32
len(positions) = 32

And when I change it to utf-32, which this post
<http://comments.gmane.org/gmane.comp.freedesktop.harfbuzz/1836> says
should make it give character counts, but

string = 'In begíffi our '
> utfstring = string.encode('utf-32')
>
> buf = hb.buffer_create()
> hb.buffer_add_utf32(buf, utfstring, 0, -1)
> hb.buffer_guess_segment_properties(buf)
>
> hb.shape(font, buf, [])
> infos = hb.buffer_get_glyph_infos(buf)
> positions = hb.buffer_get_glyph_positions(buf)
>

makes

len(string) = 15
len(infos) = 64
len(positions) = 64

What’s going on here? Why does harfbuzz suddenly output 64 glyphs? I
thought glyphs weren’t supposed to depend on the original encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20160616/bd292fc7/attachment.html>