[poppler] pdftotext needs support for surrogates outside the BMP plane
Ross Moore
ross at ics.mq.edu.au
Mon May 26 20:11:16 PDT 2008
Hi all,
Searching the archives, I came across this message:
http://lists.freedesktop.org/archives/poppler/2008-February/003401.html
from Michael Vrable, in which several issues were raised.
The thread continues with work regarding Annot.cc .
But there seems to have been no action on the following point:
>> Also missing: support for Unicode text outside the BMP, using
>> surrogate
>> pairs.
This relates to my work, where I'm developing CMap resources for
the older TeX fonts, which are used in many hundreds of thousands
of documents, available at scientific journal sites, and preprint
archives. These often use mathematical characters which are assigned
to Plane 1.
Attached is a PDF that contains many of these, in which the fonts
have /ToUnicode CMap resources, whereby the Plane-1 characters
are associated with surrogate pairs.
When extracting the text from this PDF, tools such as Adobe reader
and Apple's preview create the correct UTF-8 multibyte sequences;
viz.
math italic
<F0><9D><90><B4> <F0><9D><90><B5> <F0><9D><90><B6> ...
for Ux1D434 Ux1D435 Ux1D436 etc.
whereas pdftotext simply translates the codes for each 4-byte part
of the surrogate pair:
math italic
<ED><A0><B5><ED><B0><B4> <ED><A0><B5><ED><B0><B5>
<ED><A0><B5><ED><B0><B6>
for Ux0D835+0DC34 Ux0D835+0DC35 Ux0D835+0DC36
There are many pieces of software that do not regard the 6-byte
sequences
as being valid UTF-8. Thus there needs to be an extra step that
translates
these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte sequence.
Is anybody working on this kind of thing?
Alternatively, does anybody know how to encode Ux1D434 code-points
directly into a CMap resource, other than via a surrogate pair?
I've tried using begincidchar and begincidrange , but could not
get this to work for text-extraction via Copy/Paste.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testmath.pdf
Type: application/pdf
Size: 83005 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080527/0aa3caf7/attachment-0001.pdf
-------------- next part --------------
I'd be very grateful for any help with this.
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list