[poppler] pdftotext needs support for surrogates outside the BMP plane

Albert Astals Cid aacid at kde.org
Mon Jun 2 14:21:46 PDT 2008


A Dilluns 02 Juny 2008, Jonathan Kew va escriure:
> On 2 Jun 2008, at 9:51 pm, Ross Moore wrote:
> > Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
> >
> > A different issue arises if an unpaired surrogate is encountered when
> > converting ill-formed UTF-16 data. By represented such an unpaired
> > surrogate on its own as a 3-byte sequence, the resulting UTF-8 data
> > stream would become ill-formed. While it faithfully reflects the
> > nature of the input, Unicode conformance requires that encoding form
> > conversion always results in valid data stream. Therefore a converter
> > must treat this as an error. [AF]
> >
> >
> >
> > So yes, throwing an error is recommended; but, IMHO dropping the
> > character and continuing as far as possible would be a friendly thing
> > to do, as part of how the error is presented.
>
> Simply dropping it is a bad thing; replacing it with U+FFFD
> REPLACEMENT CHARACTER is better.
>
> > Now here's my concern about that conversion formula:
> >
> >       Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
> >
> > With | being bitwise 'or', doesn't this convert  <d840 dc00> to
> > 0x10000  when the correct result is 0x20000 ?
>
> You're right. Good catch; I should have noticed that too.
>
> > Thus this formula works correctly for Plane 1 characters only, and
> > not for higher planes.
>
> [...]
>
> >     A surrogate pair denotes the code point
> >
> >       10000 + (H - D800 ) × 400 + (L - DC00)
> >     where H and L are the hex values of the high and low surrogates
> > respectively.
> >
> > Doesn't this translate into the following ?
> >
> >       Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) +
> > 0x10000;
>
> The expression (u[i] & 0x3ff) in the original is fine; the only
> problem is that it should ADD the 0x10000, not OR it with the rest of
> the value. (The 0x400 bit can never be set on a high surrogate; if it
> were, it would have been outside the range D800..DBFF.)

Good that we caught all that, Koji can you provide an updated patch?

Albert

>
> JK




More information about the poppler mailing list