[HarfBuzz] What is wrong with unicode in harfbuzz?

Fri Jun 17 02:44:45 UTC 2016

On Thu, Jun 16, 2016 at 09:35:03PM -0400, Kelvin Ma wrote:
> When I run a simple harfbuzz shaping like
> 
> string = 'In begíffi our '
> > utfstring = string.encode('utf-8')
> >
> > buf = hb.buffer_create()
> > hb.buffer_add_utf8(buf, utfstring, 0, -1)
> > hb.buffer_guess_segment_properties(buf)
> >
> > hb.shape(font, buf, [])
> > infos = hb.buffer_get_glyph_infos(buf)
> > positions = hb.buffer_get_glyph_positions(buf)
> >
> 
> I get
> 
> len(string) = 15
> len(infos) = 13
> len(positions) = 13
> 
> which makes sense, three glyphs became one so 15 characters makes 13
> glyphs. But the cluster values are wrong because they don’t line up with
> the character indexes any more (because of the accented character).
> 
> But then when I change it to utf-16
> 
> string = 'In begíffi our '
> > utfstring = string.encode('utf-16')

You need here a list of UTF-16 code units, but string.encode('utf-16')
just gives you UTF-16 bytes array. You need something like:

utfstring = [int.from_bytes(c.encode("utf-16be"), byteorder='big') for c in string]

(This does not handle non-BMP characters that will be encoded as two
UTF-16 code units, but you get the idea).

> > hb.buffer_add_utf16(buf, utfstring, 0, -1)

And pass the list length here (or add null character at the end of the
list).

> And when I change it to utf-32, which this post
> <http://comments.gmane.org/gmane.comp.freedesktop.harfbuzz/1836> says
> should make it give character counts, but

Same as above.

Regards,
Khaled