[poppler] pdftotext needs support for surrogates outside the BMP plane

Wed May 28 01:25:36 PDT 2008

Hi.

From: Ross Moore <ross at ics.mq.edu.au>
Subject: [poppler] pdftotext needs support for surrogates outside the BMP plane
Date: Tue, 27 May 2008 13:11:16 +1000
Message-ID: <2418959F-31FA-4BE8-92D5-5BE292A89CE9 at maths.mq.edu.au>

ross> Hi all,
ross> 
ross> Searching the archives, I came across this message:
ross> 
ross> http://lists.freedesktop.org/archives/poppler/2008-February/003401.html
ross> 
ross> from Michael Vrable, in which several issues were raised.
ross> The thread continues with work regarding  Annot.cc .
ross> But there seems to have been no action on the following point:
ross> 
ross> >> Also missing: support for Unicode text outside the BMP, using  
ross> >> surrogate
ross> >> pairs.
ross> 
ross> 
ross> 
ross> This relates to my work, where I'm developing CMap resources for
ross> the older TeX fonts, which are used in many hundreds of thousands
ross> of documents, available at scientific journal sites, and preprint
ross> archives. These often use mathematical characters which are assigned
ross> to Plane 1.
ross> 
ross> Attached is a PDF that contains many of these, in which the fonts
ross> have  /ToUnicode  CMap  resources, whereby the Plane-1 characters
ross> are associated with surrogate pairs.
ross> 
ross> When extracting the text from this PDF, tools such as Adobe reader
ross> and Apple's preview create the correct UTF-8 multibyte sequences;
ross> viz.
ross> 
ross>    math italic
ross>     <F0><9D><90><B4> <F0><9D><90><B5> <F0><9D><90><B6> ...
ross> for    Ux1D434       Ux1D435          Ux1D436       etc.
ross> 
ross> whereas pdftotext simply translates the codes for each 4-byte part
ross> of the surrogate pair:
ross> 
ross>    math italic
ross>    <ED><A0><B5><ED><B0><B4> <ED><A0><B5><ED><B0><B5>  
ross> <ED><A0><B5><ED><B0><B6>
ross> for    Ux0D835+0DC34         Ux0D835+0DC35            Ux0D835+0DC36
ross> 
ross> 
ross> There are many pieces of software that do not regard the 6-byte  
ross> sequences
ross> as being valid UTF-8. Thus there needs to be an extra step that  
ross> translates
ross> these 2 x 3 = 6-byte sequences into the proper UTF-8 4-byte sequence.
ross> 
ross> Is anybody working on this kind of thing?
ross> 

I've made a patch fixes this bug, and attached it to this mail.


ross> 
ross> Alternatively, does anybody know how to encode  Ux1D434 code-points
ross> directly into a CMap resource, other than via a surrogate pair?
ross> I've tried using  begincidchar  and  begincidrange , but could not
ross> get this to work for text-extraction via Copy/Paste.
ross> 

According to PDF Reference 1.7,
ToUnicode CMaps define the mapping from character codes to Unicode
expressed in UTF-16BE.
So, I think you can't encode Ux1D434 code directly into a ToUnicode
CMap.

----------
Koji Otani

-------------- next part --------------

diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index 75a0ac0..70c188d 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -2075,7 +2075,15 @@ void TextPage::addChar(GfxState *state, double x, double y,
     w1 /= uLen;
     h1 /= uLen;
   for (i = 0; i < uLen; ++i) {
-      curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
+	if (i + 1 < uLen) {
+	  Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
+	  i++;
+	  curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+	}
+      } else {
+	curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      }
   }
   }
   if (curWord) {
diff --git a/poppler/UTF8.h b/poppler/UTF8.h
index 8536dbf..11fb864 100644
--- a/poppler/UTF8.h
+++ b/poppler/UTF8.h
@@ -50,6 +50,20 @@ static int mapUCS2(Unicode u, char *buf, int bufSize) {
     buf[0] = (char)((u >> 8) & 0xff);
     buf[1] = (char)(u & 0xff);
     return 2;
+  } else if (u < 0x110000) {
+    Unicode uu;
+
+    /* using surrogate pair */
+    if (bufSize < 4) {
+      return 0;
+    }
+    uu = ((u - 0x10000) >> 10) + 0xd800;
+    buf[0] = (char)((uu >> 8) & 0xff);
+    buf[1] = (char)(uu & 0xff);
+    uu = (u & 0x3ff)+0xdc00;
+    buf[2] = (char)((uu >> 8) & 0xff);
+    buf[3] = (char)(uu & 0xff);
+    return 4;
   } else {
     return 0;
   }