[poppler] pdftotext needs support for surrogates outside the BMP plane

Ross Moore ross at ics.mq.edu.au
Mon Jun 2 14:38:37 PDT 2008


Hi Jonathan,

Great to see you on this list!

On 03/06/2008, at 7:07 AM, Jonathan Kew wrote:
> On 2 Jun 2008, at 9:51 pm, Ross Moore wrote:

>> So yes, throwing an error is recommended; but, IMHO dropping the
>> character and continuing as far as possible would be a friendly thing
>> to do, as part of how the error is presented.
>
> Simply dropping it is a bad thing; replacing it with U+FFFD  
> REPLACEMENT CHARACTER is better.

OK, fine by me.


>> Doesn't this translate into the following ?
>>
>>       Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) +  
>> 0x10000;
>
> The expression (u[i] & 0x3ff) in the original is fine; the only  
> problem is that it should ADD the 0x10000, not OR it with the rest  
> of the value. (The 0x400 bit can never be set on a high surrogate;  
> if it were, it would have been outside the range D800..DBFF.)

Aaah; of course. And this fact makes it easy to test for the
validity of the low surrogate, which must have it set.

>
> JK

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the poppler mailing list