[FriBidi] Invalid UTF-8 for Arabic
Behnam Esfahbod ZWNJ
behnam at zwnj.org
Sat Mar 7 01:03:33 PST 2009
Dear Yoann,
I haven't used fribidi on windows, but I just run the fribid
executable in linux on your input. The result differs a lot:
$ hexdump -C arabic.output-unix
00000000 d8 9f d8 a7 d9 84 20 d8 a7 d8 b0 d8 a7 d9 85 d9 |...... .........|
00000010 84 20 d9 88 d8 a3 20 d8 a7 d8 b0 d8 a7 d9 85 d9 |. .... .........|
00000020 84 20 d8 9f d9 85 d9 88 d9 8a 20 d9 84 d9 83 20 |. ........ .... |
00000030 d8 b1 d9 88 d8 b7 d9 81 d9 84 d8 a7 20 d9 84 d9 |............ ...|
00000040 88 d8 a7 d9 86 d8 aa d8 aa 20 d9 84 d9 87 |......... ....|
0000004e
compared to your output:
$ hexdump -C arabic.output
00000000 d8 9f ef bb bb ef bb bf 20 ef ba 8d ef ba ab ef |........ .......|
00000010 ba 8e ef bb a4 ef bb 9f 20 ef bb ad ef ba 83 20 |........ ...... |
00000020 ef ba 8d ef ba ab ef ba 8e ef bb a4 ef bb 9f 20 |............... |
00000030 d8 9f ef bb a1 ef bb ae ef bb b3 20 ef bb 9e ef |........... ....|
00000040 bb 9b 20 ef ba ad ef bb ae ef bb 84 ef bb 94 ef |.. .............|
00000050 bb 9f ef ba 8d 20 ef bb 9d ef bb ad ef ba 8e ef |..... ..........|
00000060 bb a8 ef ba 98 ef ba 97 20 ef bb 9e ef bb ab |........ ......|
0000006f
I'm almost sure the unix output has no wrong utf-8 sequence, but the
windows output seems so wrong.
Just for the reference, the input for both outputs was this:
$ hexdump -C arabic.input
00000000 d9 87 d9 84 20 d8 aa d8 aa d9 86 d8 a7 d9 88 d9 |.... ...........|
00000010 84 20 d8 a7 d9 84 d9 81 d8 b7 d9 88 d8 b1 20 d9 |. ............ .|
00000020 83 d9 84 20 d9 8a d9 88 d9 85 d8 9f 20 d9 84 d9 |... ........ ...|
00000030 85 d8 a7 d8 b0 d8 a7 20 d8 a3 d9 88 20 d9 84 d9 |....... .... ...|
00000040 85 d8 a7 d8 b0 d8 a7 20 d9 84 d8 a7 d8 9f |....... ......|
0000004e
Hope it helps.
-Behnam ZWNJ
On Sat, Mar 7, 2009 at 2:25 AM, Yoann Roman <yroman at altalang.com> wrote:
> Behdad Esfahbod wrote:
>>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows
>>> XP, from Python with the Pyfribidi extension, also compiled on VS
>>> 2003.
>>
>> The first step to debug this is to make sure PyFriBidi is not the
>> culprit. That is, can you reproduce the bug using C? If yes, please
>> send the code here.
>
> I'm no C expert, so I took a slightly different approach to pull
> Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex
> editor to check its output. Looks like the lost final byte may be a
> Pyfribidi problem. This test did bring up another bug, though.
>
> Attached is a zip with:
>
> - arabic.input: the Arabic string straight out of Python. This will
> show up correctly in anything with bidi support (e.g., Notepad on
> Windows XP with Arabic support installed). There is no BOM.
>
> - arabic.output: output from running bin\fribidi.exe --nopad
> arabic.input. No Python involved here.
>
> - arabic-correct.png: a correct Word visual representation
>
> - arabic-incorrect.png: what I get using arabic.output
>
> If you open arabic.output in a Hex editor, you'll see that bytes 5
> through 7 contain the UTF-8 BOM sequence. It looks like no characters
> are missing, though.
>
> Is this enough info to track this new issue down?
>
> Thanks,
>
> --
> Yoann Roman
>
> _______________________________________________
> fribidi mailing list
> fribidi at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/fribidi
>
>
--
' بهنام اسفهبد
' Behnam Esfahbod
'
* .. http://behnam.esfahbod.info
* ` *
* o * http://zwnj.org
More information about the fribidi
mailing list