[poppler] pdftotext needs support for surrogates outside the BMP plane

Wed May 28 16:06:24 PDT 2008

On 28/05/2008, at 6:25 PM, Koji Otani wrote:
> Hi.
>
> ross> There are many pieces of software that do not regard the 6-byte
> ross> sequences
> ross> as being valid UTF-8. Thus there needs to be an extra step that
> ross> translates
> ross> these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte  
> sequence.
> ross>
> ross> Is anybody working on this kind of thing?
> ross>
>
> I've made a patch fixes this bug, and attached it to this mail.

Thank you very much for this.
It works brilliantly.

The attached image shows the result of using

      pdftotext -layout testmath.pdf

on the example PDF from my previous message,
viewed with a standard Mac text-editor application.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 21.png
Type: image/png
Size: 66655 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080529/6d57b0b5/attachment-0001.png 
-------------- next part --------------

> According to PDF Reference 1.7,
> ToUnicode CMaps define the mapping from character codes to Unicode
> expressed in UTF-16BE.
> So, I think you can't encode Ux1D434 code directly into a ToUnicode
> CMap.

That's what I suspected.
With surrogates working now, it isn't needed anyway.

Thanks again for your work.
Much appreciated.

In a following email I'll describe another problem,
regarding extraction of accent characters.

>
> ----------
> Koji Otani

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------