[HarfBuzz] [ARABIC] - 'hb_buffer_len' returning unexpected value after shaping

Laurent CRUAU Laurent.CRUAU at ingenico.com
Wed Oct 31 11:28:11 UTC 2018


Hello there,

I am pretty new to harfbuzz but anyway I had not been into trouble for long using arabic shaping until recently.
And now I am submitted something weird with very few Arabic strings (the vast majority of them do not cause any problem).

I use HB v1.0.1 on Ubuntu 16, using the regular ArialTTF mscorefont. I also tried HB v2.0.2. on an embedded target and got the same issue.

Consider the following utf16 string:
"\x8D\xFE" "\xDF\xFE" "\xB4\xFE" "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" "\x20\x00" "\xCB\xFE" "\xE0\xFE" "\xF4\xFE" "\xDC\xFE" "\xE2\xE"
Or the following UTF8:
"\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00";

After shaping has been performed, the following string is counted for 11 glyphs (i.e. w/ hb_buffer_len).
The strange thing is that some arabic speaking persons have told me that VISUALLY, we still have 12 glyphs. And I can confirm this myself if I paste this string in an online UTF8/16 decoder. I can move through 12 characters...

Is there some implicit fusion at stake there, or some information I should grab somewhere to match the visuals ?

I did not mention I played with a lot of HB options to configure shaping and I hope I have forgot something important. (hb_buffer_set_flags, hb_buffer_set_unicode_funcs(...get_default()) etc...)

Cheers,
Laurent


Here is my test snippet:

/*----------------------------------------------------------------------------
*
* HarfBuzz arabic shaping text
*
*----------------------------------------------------------------------------*/

#include <stdio.h>
#include <string.h>
#include <wchar.h>

#include <harfbuzz/hb.h>
#include <harfbuzz/hb-ft.h>

#define ARIAL_TTF ("/usr/share/fonts/truetype/msttcorefonts/Arial.ttf")

#define UTF16_TEST


static const char utf8_content[] = "\xEF\xBA\x8D\xEF\xBB\x9F\xEF\xBA\xB4\xEF\xBB\xA0\xEF\xBA\x8E\xEF\xBB\xA1\x20\xEF\xBB\x8B\xEF\xBB\xA0\xEF\xBB\xB4\xEF\xBB\x9C\xEF\xBB\xA2\x00";

static const char utf16le_content[] = "\x8D\xFE" "\xDF\xFE" "\xB4\xFE" "\xE0\xFE" "\x8E\xFE" "\xE1\xFE" "\x20\x00" "\xCB\xFE" "\xE0\xFE" "\xF4\xFE" "\xDC\xFE" "\xE2\xE" "\x0\x0";

int main( int argc, char** argv )
{
/*data*/
    hb_font_t*      font;
    hb_buffer_t*    buffer;
    hb_script_t     script;
    FT_Library      flib;
    FT_Face         face;
    int             found;
    int             ret;


/*code*/
    ret     = -1;
    font    = NULL;
    buffer  = NULL;
    found   = 0;
    script  = HB_SCRIPT_INVALID;

    if( FT_Init_FreeType(&flib) )
    {   printf("unable to initialize freetype library\n");
        goto main_exit;
    }

    if( FT_New_Face(flib, ARIAL_TTF, 0, &face) )
    {   printf("cannot create face\n");
        goto main_exit;
    }

    font = hb_ft_font_create(face, NULL);
    if( !font )
    {   printf("uanble to create font\n");
        goto main_exit;
    }

    buffer = hb_buffer_create();
    if( !buffer )
    {   printf("uanble to create buffer\n");
        goto main_exit;
    }

    // Assign text segment to buffer and examine its properties
#ifdef UTF16_TEST
    hb_buffer_add_utf16(buffer, (const uint16_t*)utf16le_content, 12, 0, 12);
#else
    hb_buffer_add_utf8(buffer, utf8_content, -1, 0, -1);
#endif
    hb_buffer_guess_segment_properties(buffer);

    // Get script type of text
    script = hb_buffer_get_script(buffer);   //Do not check here but Arabic script IS detected

    hb_buffer_set_direction(buffer, HB_DIRECTION_RTL);
    hb_buffer_set_language(buffer, hb_language_from_string("ar", -1));

    hb_shape(font, buffer, NULL, 0);
    printf("SHAPED !\n");


    printf("got %d characters as a result\n", hb_buffer_get_length(buffer) );

    ret = 0;

main_exit:
  //test only, free another day
    exit(ret);
}
This email and its content belong to Ingenico Group. The enclosed information is confidential and may not be disclosed to any unauthorized person. If you have received it by mistake do not forward it and delete it from your system. Cet email et son contenu sont la propri?t? du Groupe Ingenico. L'information qu'il contient est confidentielle et ne peut ?tre communiqu?e ? des personnes non autoris?es. Si vous l'avez re?u par erreur ne le transf?rez pas et supprimez-le.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20181031/3342a809/attachment-0001.html>


More information about the HarfBuzz mailing list