[poppler] pdftotext needs support for surrogates outside the BMP plane

Mon Jun 2 13:51:30 PDT 2008

Hi Albert,

On 03/06/2008, at 4:35 AM, Albert Astals Cid wrote:
> A Dilluns 02 Juny 2008, Koji Otani va escriure:
>> Thank you.
>>
>> I could view the text file with Unicode Symbol font.
>>
>>> Albert
>>
>>  Could you conform the patch with these information ?
>
> Works here with the Unicode Symbols font too.

Great.

>
> One last thing:
>
> +      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
> +       if (i + 1 < uLen) {
> +         Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) |  
> 0x10000;

I have a slight concern with this formula perhaps being too simple.
(see below)

> +         i++;
> +         curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c,  
> uu);
> +       }
> +      } else {
> +       curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u 
> [i]);
> +      }
>
> That happens if "if (i + 1 < uLen) {" is false? Do we lose a char?  
> Or that
> should never happen and is an error? If it's an error i think we  
> should have
> an else branch with something like
> } else {
>     error(-1, "Got surrogate pair start char but did not have  
> second char")
> }

The UTF8 FAQ is here:

      http://unicode.org/faq/utf_bom.html#UTF8

One item states:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when  
converting ill-formed UTF-16 data. By represented such an unpaired  
surrogate on its own as a 3-byte sequence, the resulting UTF-8 data  
stream would become ill-formed. While it faithfully reflects the  
nature of the input, Unicode conformance requires that encoding form  
conversion always results in valid data stream. Therefore a converter  
must treat this as an error. [AF]

So yes, throwing an error is recommended; but, IMHO dropping the  
character and continuing as far as possible would be a friendly thing  
to do, as part of how the error is presented.

Now here's my concern about that conversion formula:

      Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;

With | being bitwise 'or', doesn't this convert  <d840 dc00> to   
0x10000  when the correct result is 0x20000 ?

Thus this formula works correctly for Plane 1 characters only, and  
not for higher planes.

Or am I just wrong, due to my lack of experience in programming in C ?

Wikipedia

     http://en.wikipedia.org/wiki/ 
Mapping_of_Unicode_characters#Surrogates

gives this formula:

    A surrogate pair denotes the code point

      10000 + (H - D800 ) × 400 + (L - DC00)
    where H and L are the hex values of the high and low surrogates  
respectively.

Doesn't this translate into the following ?

      Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) + 0x10000;

with u[i] and u[i+1] within their valid ranges; or maybe

      Unicode uu = (((u[i] & 0x7ff) << 10 ) | (u[i+1] & 0x3ff)) +  
0x10000;

BTW, the Code2002 font will let you test Plane 2 characters.

Doubtless there are other free fonts too.

>
> Albert
>
>> -----------
>> Koji Otani.

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------