[poppler] pdftotext needs support for surrogates outside the BMP plane
Jonathan Kew
jonathan_kew at sil.org
Mon Jun 2 12:59:14 PDT 2008
On 2 Jun 2008, at 7:35 pm, Albert Astals Cid wrote:
> A Dilluns 02 Juny 2008, Koji Otani va escriure:
>> Thank you.
>>
>> I could view the text file with Unicode Symbol font.
>>
>>> Albert
>>
>> Could you conform the patch with these information ?
>
> Works here with the Unicode Symbols font too.
>
> One last thing:
>
> + if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
> + if (i + 1 < uLen) {
> + Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |
> 0x10000;
> + i++;
> + curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,
> uu);
> + }
> + } else {
> + curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u
> [i]);
> + }
>
> That happens if "if (i + 1 < uLen) {" is false? Do we lose a char?
> Or that
> should never happen and is an error?
Yes, it's an error in the UTF16 data. The most appropriate thing to
do is probably to replace it with U+FFFD.
Actually, the code should also be checking that u[i+1] is a valid
surrogate code in the range 0xdc00..0xdfff; if not, it's not correct
to combine the two code units like that. And to be more robust, it
should also be checking for (invalid) low surrogates.
I don't currently have a working copy to actually test the code on
this machine, but I think that fragment should be something like this:
+ if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* high surrogate */
+ if (i + 1 < uLen && u[i+1] >= 0xdc00 && u[i+1] < 0xe000) { /*
check next */
+ Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |
0x10000;
+ i++;
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+ } else {
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,
0xfffd);
+ }
+ } else if (u[i] >= 0xdc00 && u[i] < 0xe000) { /* invalid low
surrogate */
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,
0xfffd);
+ } else {
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+ }
JK
More information about the poppler
mailing list