[poppler] pdftotext needs support for surrogates outside the BMP plane
Koji Otani
sho at bbr.jp
Wed May 28 01:25:36 PDT 2008
Hi.
From: Ross Moore <ross at ics.mq.edu.au>
Subject: [poppler] pdftotext needs support for surrogates outside the BMP plane
Date: Tue, 27 May 2008 13:11:16 +1000
Message-ID: <2418959F-31FA-4BE8-92D5-5BE292A89CE9 at maths.mq.edu.au>
ross> Hi all,
ross>
ross> Searching the archives, I came across this message:
ross>
ross> http://lists.freedesktop.org/archives/poppler/2008-February/003401.html
ross>
ross> from Michael Vrable, in which several issues were raised.
ross> The thread continues with work regarding Annot.cc .
ross> But there seems to have been no action on the following point:
ross>
ross> >> Also missing: support for Unicode text outside the BMP, using
ross> >> surrogate
ross> >> pairs.
ross>
ross>
ross>
ross> This relates to my work, where I'm developing CMap resources for
ross> the older TeX fonts, which are used in many hundreds of thousands
ross> of documents, available at scientific journal sites, and preprint
ross> archives. These often use mathematical characters which are assigned
ross> to Plane 1.
ross>
ross> Attached is a PDF that contains many of these, in which the fonts
ross> have /ToUnicode CMap resources, whereby the Plane-1 characters
ross> are associated with surrogate pairs.
ross>
ross> When extracting the text from this PDF, tools such as Adobe reader
ross> and Apple's preview create the correct UTF-8 multibyte sequences;
ross> viz.
ross>
ross> math italic
ross> <F0><9D><90><B4> <F0><9D><90><B5> <F0><9D><90><B6> ...
ross> for Ux1D434 Ux1D435 Ux1D436 etc.
ross>
ross> whereas pdftotext simply translates the codes for each 4-byte part
ross> of the surrogate pair:
ross>
ross> math italic
ross> <ED><A0><B5><ED><B0><B4> <ED><A0><B5><ED><B0><B5>
ross> <ED><A0><B5><ED><B0><B6>
ross> for Ux0D835+0DC34 Ux0D835+0DC35 Ux0D835+0DC36
ross>
ross>
ross> There are many pieces of software that do not regard the 6-byte
ross> sequences
ross> as being valid UTF-8. Thus there needs to be an extra step that
ross> translates
ross> these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte sequence.
ross>
ross> Is anybody working on this kind of thing?
ross>
I've made a patch fixes this bug, and attached it to this mail.
ross>
ross> Alternatively, does anybody know how to encode Ux1D434 code-points
ross> directly into a CMap resource, other than via a surrogate pair?
ross> I've tried using begincidchar and begincidrange , but could not
ross> get this to work for text-extraction via Copy/Paste.
ross>
According to PDF Reference 1.7,
ToUnicode CMaps define the mapping from character codes to Unicode
expressed in UTF-16BE.
So, I think you can't encode Ux1D434 code directly into a ToUnicode
CMap.
----------
Koji Otani
-------------- next part --------------
diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index 75a0ac0..70c188d 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -2075,7 +2075,15 @@ void TextPage::addChar(GfxState *state, double x, double y,
w1 /= uLen;
h1 /= uLen;
for (i = 0; i < uLen; ++i) {
- curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+ if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
+ if (i + 1 < uLen) {
+ Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
+ i++;
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+ }
+ } else {
+ curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+ }
}
}
if (curWord) {
diff --git a/poppler/UTF8.h b/poppler/UTF8.h
index 8536dbf..11fb864 100644
--- a/poppler/UTF8.h
+++ b/poppler/UTF8.h
@@ -50,6 +50,20 @@ static int mapUCS2(Unicode u, char *buf, int bufSize) {
buf[0] = (char)((u >> 8) & 0xff);
buf[1] = (char)(u & 0xff);
return 2;
+ } else if (u < 0x110000) {
+ Unicode uu;
+
+ /* using surrogate pair */
+ if (bufSize < 4) {
+ return 0;
+ }
+ uu = ((u - 0x10000) >> 10) + 0xd800;
+ buf[0] = (char)((uu >> 8) & 0xff);
+ buf[1] = (char)(uu & 0xff);
+ uu = (u & 0x3ff)+0xdc00;
+ buf[2] = (char)((uu >> 8) & 0xff);
+ buf[3] = (char)(uu & 0xff);
+ return 4;
} else {
return 0;
}
More information about the poppler
mailing list