[FriBidi] Invalid UTF-8 for Arabic

Behnam Esfahbod ZWNJ behnam at zwnj.org
Sat Mar 7 01:03:33 PST 2009


Dear Yoann,

I haven't used fribidi on windows, but I just run the fribid
executable in linux on your input.  The result differs a lot:

$ hexdump -C arabic.output-unix
00000000  d8 9f d8 a7 d9 84 20 d8  a7 d8 b0 d8 a7 d9 85 d9  |...... .........|
00000010  84 20 d9 88 d8 a3 20 d8  a7 d8 b0 d8 a7 d9 85 d9  |. .... .........|
00000020  84 20 d8 9f d9 85 d9 88  d9 8a 20 d9 84 d9 83 20  |. ........ .... |
00000030  d8 b1 d9 88 d8 b7 d9 81  d9 84 d8 a7 20 d9 84 d9  |............ ...|
00000040  88 d8 a7 d9 86 d8 aa d8  aa 20 d9 84 d9 87        |......... ....|
0000004e

compared to your output:

$ hexdump -C arabic.output
00000000  d8 9f ef bb bb ef bb bf  20 ef ba 8d ef ba ab ef  |........ .......|
00000010  ba 8e ef bb a4 ef bb 9f  20 ef bb ad ef ba 83 20  |........ ...... |
00000020  ef ba 8d ef ba ab ef ba  8e ef bb a4 ef bb 9f 20  |............... |
00000030  d8 9f ef bb a1 ef bb ae  ef bb b3 20 ef bb 9e ef  |........... ....|
00000040  bb 9b 20 ef ba ad ef bb  ae ef bb 84 ef bb 94 ef  |.. .............|
00000050  bb 9f ef ba 8d 20 ef bb  9d ef bb ad ef ba 8e ef  |..... ..........|
00000060  bb a8 ef ba 98 ef ba 97  20 ef bb 9e ef bb ab     |........ ......|
0000006f

I'm almost sure the unix output has no wrong utf-8 sequence, but the
windows output seems so wrong.

Just for the reference, the input for both outputs was this:

$ hexdump -C arabic.input
00000000  d9 87 d9 84 20 d8 aa d8  aa d9 86 d8 a7 d9 88 d9  |.... ...........|
00000010  84 20 d8 a7 d9 84 d9 81  d8 b7 d9 88 d8 b1 20 d9  |. ............ .|
00000020  83 d9 84 20 d9 8a d9 88  d9 85 d8 9f 20 d9 84 d9  |... ........ ...|
00000030  85 d8 a7 d8 b0 d8 a7 20  d8 a3 d9 88 20 d9 84 d9  |....... .... ...|
00000040  85 d8 a7 d8 b0 d8 a7 20  d9 84 d8 a7 d8 9f        |....... ......|
0000004e

Hope it helps.

-Behnam ZWNJ


On Sat, Mar 7, 2009 at 2:25 AM, Yoann Roman <yroman at altalang.com> wrote:
> Behdad Esfahbod wrote:
>>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows
>>> XP, from Python with the Pyfribidi extension, also compiled on VS
>>> 2003.
>>
>> The first step to debug this is to make sure PyFriBidi is not the
>> culprit. That is, can you reproduce the bug using C?  If yes, please
>> send the code here.
>
> I'm no C expert, so I took a slightly different approach to pull
> Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex
> editor to check its output. Looks like the lost final byte may be a
> Pyfribidi problem. This test did bring up another bug, though.
>
> Attached is a zip with:
>
>  - arabic.input: the Arabic string straight out of Python. This will
>    show up correctly in anything with bidi support (e.g., Notepad on
>    Windows XP with Arabic support installed). There is no BOM.
>
>  - arabic.output: output from running bin\fribidi.exe --nopad
>    arabic.input. No Python involved here.
>
>  - arabic-correct.png: a correct Word visual representation
>
>  - arabic-incorrect.png: what I get using arabic.output
>
> If you open arabic.output in a Hex editor, you'll see that bytes 5
> through 7 contain the UTF-8 BOM sequence. It looks like no characters
> are missing, though.
>
> Is this enough info to track this new issue down?
>
> Thanks,
>
> --
> Yoann Roman
>
> _______________________________________________
> fribidi mailing list
> fribidi at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/fribidi
>
>



-- 
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '
  *  ..   http://behnam.esfahbod.info
 *  `  *
  * o *   http://zwnj.org


More information about the fribidi mailing list