[poppler] pdftotext needs support for surrogates outside the BMP plane
Ross Moore
ross at ics.mq.edu.au
Mon Jun 2 13:51:30 PDT 2008
Hi Albert,
On 03/06/2008, at 4:35 AM, Albert Astals Cid wrote:
> A Dilluns 02 Juny 2008, Koji Otani va escriure:
>> Thank you.
>>
>> I could view the text file with Unicode Symbol font.
>>
>>> Albert
>>
>> Could you conform the patch with these information ?
>
> Works here with the Unicode Symbols font too.
Great.
>
> One last thing:
>
> + if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
> + if (i + 1 < uLen) {
> + Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |
> 0x10000;
I have a slight concern with this formula perhaps being too simple.
(see below)
> + i++;
> + curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,
> uu);
> + }
> + } else {
> + curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u
> [i]);
> + }
>
> That happens if "if (i + 1 < uLen) {" is false? Do we lose a char?
> Or that
> should never happen and is an error? If it's an error i think we
> should have
> an else branch with something like
> } else {
> error(-1, "Got surrogate pair start char but did not have
> second char")
> }
The UTF8 FAQ is here:
http://unicode.org/faq/utf_bom.html#UTF8
One item states:
Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By represented such an unpaired
surrogate on its own as a 3-byte sequence, the resulting UTF-8 data
stream would become ill-formed. While it faithfully reflects the
nature of the input, Unicode conformance requires that encoding form
conversion always results in valid data stream. Therefore a converter
must treat this as an error. [AF]
So yes, throwing an error is recommended; but, IMHO dropping the
character and continuing as far as possible would be a friendly thing
to do, as part of how the error is presented.
Now here's my concern about that conversion formula:
Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
With | being bitwise 'or', doesn't this convert <d840 dc00> to
0x10000 when the correct result is 0x20000 ?
Thus this formula works correctly for Plane 1 characters only, and
not for higher planes.
Or am I just wrong, due to my lack of experience in programming in C ?
Wikipedia
http://en.wikipedia.org/wiki/
Mapping_of_Unicode_characters#Surrogates
gives this formula:
A surrogate pair denotes the code point
10000 + (H - D800 ) × 400 + (L - DC00)
where H and L are the hex values of the high and low surrogates
respectively.
Doesn't this translate into the following ?
Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) + 0x10000;
with u[i] and u[i+1] within their valid ranges; or maybe
Unicode uu = (((u[i] & 0x7ff) << 10 ) | (u[i+1] & 0x3ff)) +
0x10000;
BTW, the Code2002 font will let you test Plane 2 characters.
Doubtless there are other free fonts too.
>
> Albert
>
>> -----------
>> Koji Otani.
Hope this helps,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list