[poppler] pdftotext needs support for surrogates outside the BMP plane

Mon Jun 2 20:59:30 PDT 2008

Hi All,

Thank you for your checking my patch.
I attach new patch.

--------------
Koji Otani.


From: Albert Astals Cid <aacid at kde.org>
Subject: Re: [poppler] pdftotext needs support for surrogates outside the BMP plane
Date: Mon, 2 Jun 2008 23:21:46 +0200
Message-ID: <200806022321.48691.aacid at kde.org>

aacid> A Dilluns 02 Juny 2008, Jonathan Kew va escriure:
aacid> > On 2 Jun 2008, at 9:51 pm, Ross Moore wrote:
aacid> > > Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
aacid> > >
aacid> > > A different issue arises if an unpaired surrogate is encountered when
aacid> > > converting ill-formed UTF-16 data. By represented such an unpaired
aacid> > > surrogate on its own as a 3-byte sequence, the resulting UTF-8 data
aacid> > > stream would become ill-formed. While it faithfully reflects the
aacid> > > nature of the input, Unicode conformance requires that encoding form
aacid> > > conversion always results in valid data stream. Therefore a converter
aacid> > > must treat this as an error. [AF]
aacid> > >
aacid> > >
aacid> > >
aacid> > > So yes, throwing an error is recommended; but, IMHO dropping the
aacid> > > character and continuing as far as possible would be a friendly thing
aacid> > > to do, as part of how the error is presented.
aacid> >
aacid> > Simply dropping it is a bad thing; replacing it with U+FFFD
aacid> > REPLACEMENT CHARACTER is better.
aacid> >
aacid> > > Now here's my concern about that conversion formula:
aacid> > >
aacid> > >       Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
aacid> > >
aacid> > > With | being bitwise 'or', doesn't this convert  <d840 dc00> to
aacid> > > 0x10000  when the correct result is 0x20000 ?
aacid> >
aacid> > You're right. Good catch; I should have noticed that too.
aacid> >
aacid> > > Thus this formula works correctly for Plane 1 characters only, and
aacid> > > not for higher planes.
aacid> >
aacid> > [...]
aacid> >
aacid> > >     A surrogate pair denotes the code point
aacid> > >
aacid> > >       10000 + (H - D800 ) × 400 + (L - DC00)
aacid> > >     where H and L are the hex values of the high and low surrogates
aacid> > > respectively.
aacid> > >
aacid> > > Doesn't this translate into the following ?
aacid> > >
aacid> > >       Unicode uu = ((u[i] & 0x7ff) << 10 ) + (u[i+1] & 0x3ff) +
aacid> > > 0x10000;
aacid> >
aacid> > The expression (u[i] & 0x3ff) in the original is fine; the only
aacid> > problem is that it should ADD the 0x10000, not OR it with the rest of
aacid> > the value. (The 0x400 bit can never be set on a high surrogate; if it
aacid> > were, it would have been outside the range D800..DBFF.)
aacid> 
aacid> Good that we caught all that, Koji can you provide an updated patch?
aacid> 
aacid> Albert
aacid> 
aacid> >
aacid> > JK
aacid> 
aacid> 
aacid> _______________________________________________
aacid> poppler mailing list
aacid> poppler at lists.freedesktop.org
aacid> http://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------

diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index 75a0ac0..97f4f3f 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -2075,7 +2075,24 @@ void TextPage::addChar(GfxState *state, double x, double y,
     w1 /= uLen;
     h1 /= uLen;
   for (i = 0; i < uLen; ++i) {
-      curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
+	if (i + 1 < uLen && u[i+1] >= 0xdc00 && u[i+1] < 0xe000) {
+	  /* next code is a low surrogate */
+	  Unicode uu = (((u[i] & 0x3ff) << 10) | (u[i+1] & 0x3ff)) + 0x10000;
+	  i++;
+	  curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+	} else {
+	    /* missing low surrogate
+	     replace it with REPLACEMENT CHARACTER (U+FFFD) */
+	  curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, 0xfffd);
+	}
+      } else if (u[i] >= 0xdc00 && u[i] < 0xe000) {
+	  /* invalid low surrogate
+	   replace it with REPLACEMENT CHARACTER (U+FFFD) */
+	curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, 0xfffd);
+      } else {
+	curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      }
   }
   }
   if (curWord) {
diff --git a/poppler/UTF8.h b/poppler/UTF8.h
index 8536dbf..11fb864 100644
--- a/poppler/UTF8.h
+++ b/poppler/UTF8.h
@@ -50,6 +50,20 @@ static int mapUCS2(Unicode u, char *buf, int bufSize) {
     buf[0] = (char)((u >> 8) & 0xff);
     buf[1] = (char)(u & 0xff);
     return 2;
+  } else if (u < 0x110000) {
+    Unicode uu;
+
+    /* using surrogate pair */
+    if (bufSize < 4) {
+      return 0;
+    }
+    uu = ((u - 0x10000) >> 10) + 0xd800;
+    buf[0] = (char)((uu >> 8) & 0xff);
+    buf[1] = (char)(uu & 0xff);
+    uu = (u & 0x3ff)+0xdc00;
+    buf[2] = (char)((uu >> 8) & 0xff);
+    buf[3] = (char)(uu & 0xff);
+    return 4;
   } else {
     return 0;
   }