[poppler] pdftotext needs support for surrogates outside the BMP plane
Ross Moore
ross at ics.mq.edu.au
Wed May 28 16:06:24 PDT 2008
On 28/05/2008, at 6:25 PM, Koji Otani wrote:
> Hi.
>
> ross> There are many pieces of software that do not regard the 6-byte
> ross> sequences
> ross> as being valid UTF-8. Thus there needs to be an extra step that
> ross> translates
> ross> these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte
> sequence.
> ross>
> ross> Is anybody working on this kind of thing?
> ross>
>
> I've made a patch fixes this bug, and attached it to this mail.
Thank you very much for this.
It works brilliantly.
The attached image shows the result of using
pdftotext -layout testmath.pdf
on the example PDF from my previous message,
viewed with a standard Mac text-editor application.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 21.png
Type: image/png
Size: 66655 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080529/6d57b0b5/attachment-0001.png
-------------- next part --------------
> According to PDF Reference 1.7,
> ToUnicode CMaps define the mapping from character codes to Unicode
> expressed in UTF-16BE.
> So, I think you can't encode Ux1D434 code directly into a ToUnicode
> CMap.
That's what I suspected.
With surrogates working now, it isn't needed anyway.
Thanks again for your work.
Much appreciated.
In a following email I'll describe another problem,
regarding extraction of accent characters.
>
> ----------
> Koji Otani
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list