[HarfBuzz] [ARABIC] - 'hb_buffer_len' returning unexpected value after shaping

Wed Oct 31 12:03:53 UTC 2018

On Wed, Oct 31, 2018 at 11:28:11AM +0000, Laurent CRUAU wrote:
> Hello there,
> 
> I am pretty new to harfbuzz but anyway I had not been into trouble for long using arabic shaping until recently.
> And now I am submitted something weird with very few Arabic strings (the vast majority of them do not cause any problem).
> 
> I use HB v1.0.1 on Ubuntu 16, using the regular ArialTTF mscorefont. I also tried HB v2.0.2. on an embedded target and got the same issue.
> 
> Consider the following utf16 string:
> "\x8D\xFE" "\xDF\xFE" "\xB4\xFE" "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" "\x20\x00" "\xCB\xFE" "\xE0\xFE" "\xF4\xFE" "\xDC\xFE" "\xE2\xE"
> Or the following UTF8:
> "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00";

How did you get the string? It uses Arabic Presentation Forms, and
though it is technically valid Unicode text, that is not usually the
kind of input HarfBuzz should be taking.

> After shaping has been performed, the following string is counted for 11 glyphs (i.e. w/ hb_buffer_len).

The number of output glyphs does not have to be the same as the number
of input characters. If there are ligatures then the number of glyphs
can be less, and if there are any decompositions, then the number of
glyphs can be more. In general your code should not make any assumptions
about the number of glyphs based on the number of input characters.

To match output glyphs with input characters, you should use the cluster
field of glyph info.

Regards,
Khaled