[poppler] pdftotext needs support for surrogates outside the BMP plane
Ross Moore
ross at ics.mq.edu.au
Mon Jun 2 14:38:37 PDT 2008
Hi Jonathan,
Great to see you on this list!
On 03/06/2008, at 7:07 AM, Jonathan Kew wrote:
> On 2 Jun 2008, at 9:51 pm, Ross Moore wrote:
>> So yes, throwing an error is recommended; but, IMHO dropping the
>> character and continuing as far as possible would be a friendly thing
>> to do, as part of how the error is presented.
>
> Simply dropping it is a bad thing; replacing it with U+FFFD
> REPLACEMENT CHARACTER is better.
OK, fine by me.
>> Doesn't this translate into the following ?
>>
>> Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) +
>> 0x10000;
>
> The expression (u[i] & 0x3ff) in the original is fine; the only
> problem is that it should ADD the 0x10000, not OR it with the rest
> of the value. (The 0x400 bit can never be set on a high surrogate;
> if it were, it would have been outside the range D800..DBFF.)
Aaah; of course. And this fact makes it easy to test for the
validity of the low surrogate, which must have it set.
>
> JK
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list