[HarfBuzz] [ARABIC] - 'hb_buffer_len' returning unexpected value after shaping

Behdad Esfahbod behdad at behdad.org
Thu Nov 1 14:54:49 UTC 2018


That happens when ligatures are applied to text.  In general, there is no
1-to-1 relationship between characters and glyphs.  You can use the cluster
values in hb_glyph_info_t to match clusters of glyphs to clusters of
characters.  That's the most granular mapping you can get.  In this case,
you will see that one glyph (LAM-ALEF) corresponds to two characters (LAM
and ALEF).

On Wed, Oct 31, 2018 at 7:28 AM Laurent CRUAU <Laurent.CRUAU at ingenico.com>
wrote:

> Hello there,
>
>
>
> I am pretty new to harfbuzz but anyway I had not been into trouble for
> long using arabic shaping until recently.
>
> And now I am submitted something weird with very few Arabic strings (the
> vast majority of them do not cause any problem).
>
>
>
> I use HB v1.0.1 on Ubuntu 16, using the regular ArialTTF mscorefont. I
> also tried HB v2.0.2. on an embedded target and got the same issue.
>
>
>
> Consider the following utf16 string:
>
> "\x8D\xFE" "\xDF\xFE" "\xB4\xFE" "\xE0\xFE" "\x8E\xFE" "\xE1\xFE"
> "\x20\x00" "\xCB\xFE" "\xE0\xFE" "\xF4\xFE" "\xDC\xFE" "\xE2\xE”
>
> Or the following UTF8:
>
>
> "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00";
>
>
>
> After shaping has been performed, the following string is counted for 11
> glyphs (i.e. w/ hb_buffer_len).
>
> The strange thing is that some arabic speaking persons have told me that
> VISUALLY, we still have 12 glyphs. And I can confirm this myself if I paste
> this string in an online UTF8/16 decoder. I can move through 12 characters…
>
>
>
> Is there some implicit fusion at stake there, or some information I should
> grab somewhere to match the visuals ?
>
>
>
> I did not mention I played with a lot of HB options to configure shaping
> and I hope I have forgot something important. (hb_buffer_set_flags,
> hb_buffer_set_unicode_funcs(…get_default()) etc…)
>
>
>
> Cheers,
>
> Laurent
>
>
>
>
>
> Here is my test snippet:
>
>
>
>
> /*----------------------------------------------------------------------------
>
> *
>
> * HarfBuzz arabic shaping text
>
> *
>
>
> *----------------------------------------------------------------------------*/
>
>
>
> #include <stdio.h>
>
> #include <string.h>
>
> #include <wchar.h>
>
>
>
> #include <harfbuzz/hb.h>
>
> #include <harfbuzz/hb-ft.h>
>
>
>
> #define ARIAL_TTF ("/usr/share/fonts/truetype/msttcorefonts/Arial.ttf")
>
>
>
> #define UTF16_TEST
>
>
>
>
>
> static const char utf8_content[] =
> "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00";
>
>
>
> static const char utf16le_content[] = "\x8D\xFE" "\xDF\xFE" "\xB4\xFE"
> "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" "\x20\x00" "\xCB\xFE" "\xE0\xFE"
> "\xF4\xFE" "\xDC\xFE" "\xE2\xE" "\x0\x0";
>
>
>
> int main( int argc, char** argv )
>
> {
>
> /*data*/
>
>     hb_font_t*      font;
>
>     hb_buffer_t*    buffer;
>
>     hb_script_t     script;
>
>     FT_Library      flib;
>
>     FT_Face         face;
>
>     int             found;
>
>     int             ret;
>
>
>
>
>
> /*code*/
>
>     ret     = -1;
>
>     font    = NULL;
>
>     buffer  = NULL;
>
>     found   = 0;
>
>     script  = HB_SCRIPT_INVALID;
>
>
>
>     if( FT_Init_FreeType(&flib) )
>
>     {   printf("unable to initialize freetype library\n");
>
>         goto main_exit;
>
>     }
>
>
>
>     if( FT_New_Face(flib, ARIAL_TTF, 0, &face) )
>
>     {   printf("cannot create face\n");
>
>         goto main_exit;
>
>     }
>
>
>
>     font = hb_ft_font_create(face, NULL);
>
>     if( !font )
>
>     {   printf("uanble to create font\n");
>
>         goto main_exit;
>
>     }
>
>
>
>     buffer = hb_buffer_create();
>
>     if( !buffer )
>
>     {   printf("uanble to create buffer\n");
>
>         goto main_exit;
>
>     }
>
>
>
>     // Assign text segment to buffer and examine its properties
>
> #ifdef UTF16_TEST
>
>     hb_buffer_add_utf16(buffer, (const uint16_t*)utf16le_content, 12, 0,
> 12);
>
> #else
>
>     hb_buffer_add_utf8(buffer, utf8_content, -1, 0, -1);
>
> #endif
>
>     hb_buffer_guess_segment_properties(buffer);
>
>
>
>     // Get script type of text
>
>     script = hb_buffer_get_script(buffer);   //Do not check here but
> Arabic script IS detected
>
>
>
>     hb_buffer_set_direction(buffer, HB_DIRECTION_RTL);
>
>     hb_buffer_set_language(buffer, hb_language_from_string("ar", -1));
>
>
>
>     hb_shape(font, buffer, NULL, 0);
>
>     printf("SHAPED !\n");
>
>
>
>
>
>     printf("got %d characters as a result\n", hb_buffer_get_length(buffer)
> );
>
>
>
>     ret = 0;
>
>
>
> main_exit:
>
>   //test only, free another day
>
>     exit(ret);
>
> }
> This email and its content belong to Ingenico Group. The enclosed
> information is confidential and may not be disclosed to any unauthorized
> person. If you have received it by mistake do not forward it and delete it
> from your system. Cet email et son contenu sont la propriété du Groupe
> Ingenico. L’information qu’il contient est confidentielle et ne peut être
> communiquée à des personnes non autorisées. Si vous l’avez reçu par erreur
> ne le transférez pas et supprimez-le.
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/harfbuzz
>


-- 
behdad
http://behdad.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20181101/0aaceadd/attachment.html>


More information about the HarfBuzz mailing list