[poppler] pdftotext needs support for surrogates outside the BMP plane

Mon Jun 2 12:59:14 PDT 2008

On 2 Jun 2008, at 7:35 pm, Albert Astals Cid wrote:

> A Dilluns 02 Juny 2008, Koji Otani va escriure:
>> Thank you.
>>
>> I could view the text file with Unicode Symbol font.
>>
>>> Albert
>>
>>  Could you conform the patch with these information ?
>
> Works here with the Unicode Symbols font too.
>
> One last thing:
>
> +      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
> +       if (i + 1 < uLen) {
> +         Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |  
> 0x10000;
> +         i++;
> +         curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,  
> uu);
> +       }
> +      } else {
> +       curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u 
> [i]);
> +      }
>
> That happens if "if (i + 1 < uLen) {" is false? Do we lose a char?  
> Or that
> should never happen and is an error?

Yes, it's an error in the UTF16 data. The most appropriate thing to  
do is probably to replace it with U+FFFD.

Actually, the code should also be checking that u[i+1] is a valid  
surrogate code in the range 0xdc00..0xdfff; if not, it's not correct  
to combine the two code units like that. And to be more robust, it  
should also be checking for (invalid) low surrogates.

I don't currently have a working copy to actually test the code on  
this machine, but I think that fragment should be something like this:

+      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* high surrogate */
+       if (i + 1 < uLen && u[i+1] >= 0xdc00 && u[i+1] < 0xe000) { /*  
check next */
+         Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |  
0x10000;
+         i++;
+         curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+       } else {
+         curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,  
0xfffd);
+       }
+      } else if (u[i] >= 0xdc00 && u[i] < 0xe000) { /* invalid low  
surrogate */
+       curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,  
0xfffd);
+      } else {
+       curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      }

JK