[poppler] pdftotext needs support for surrogates outside the BMP plane

Ross Moore ross at ics.mq.edu.au
Mon May 26 20:11:16 PDT 2008


Hi all,

Searching the archives, I came across this message:

http://lists.freedesktop.org/archives/poppler/2008-February/003401.html

from Michael Vrable, in which several issues were raised.
The thread continues with work regarding  Annot.cc .
But there seems to have been no action on the following point:

>> Also missing: support for Unicode text outside the BMP, using  
>> surrogate
>> pairs.



This relates to my work, where I'm developing CMap resources for
the older TeX fonts, which are used in many hundreds of thousands
of documents, available at scientific journal sites, and preprint
archives. These often use mathematical characters which are assigned
to Plane 1.

Attached is a PDF that contains many of these, in which the fonts
have  /ToUnicode  CMap  resources, whereby the Plane-1 characters
are associated with surrogate pairs.

When extracting the text from this PDF, tools such as Adobe reader
and Apple's preview create the correct UTF-8 multibyte sequences;
viz.

   math italic
    <F0><9D><90><B4> <F0><9D><90><B5> <F0><9D><90><B6> ...
for    Ux1D434       Ux1D435          Ux1D436       etc.

whereas pdftotext simply translates the codes for each 4-byte part
of the surrogate pair:

   math italic
   <ED><A0><B5><ED><B0><B4> <ED><A0><B5><ED><B0><B5>  
<ED><A0><B5><ED><B0><B6>
for    Ux0D835+0DC34         Ux0D835+0DC35            Ux0D835+0DC36


There are many pieces of software that do not regard the 6-byte  
sequences
as being valid UTF-8. Thus there needs to be an extra step that  
translates
these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte sequence.

Is anybody working on this kind of thing?


Alternatively, does anybody know how to encode  Ux1D434 code-points
directly into a CMap resource, other than via a surrogate pair?
I've tried using  begincidchar  and  begincidrange , but could not
get this to work for text-extraction via Copy/Paste.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: testmath.pdf
Type: application/pdf
Size: 83005 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080527/0aa3caf7/attachment-0001.pdf 
-------------- next part --------------


I'd be very grateful for any help with this.


Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the poppler mailing list