[poppler] pdftotext needs support for surrogates outside the BMP plane

Albert Astals Cid aacid at kde.org
Mon Jun 2 11:35:18 PDT 2008


A Dilluns 02 Juny 2008, Koji Otani va escriure:
> Thank you.
>
> I could view the text file with Unicode Symbol font.
>
> > Albert
>
>  Could you conform the patch with these information ?

Works here with the Unicode Symbols font too.

One last thing:

+      if (u[i] >= 0xd800 && u[i] < 0xdc00) { /* surrogate pair */
+       if (i + 1 < uLen) {
+         Unicode uu = (u[i] & 0x3ff) << 10 | (u[i+1] & 0x3ff) | 0x10000;
+         i++;
+         curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, uu);
+       }
+      } else {
+       curWord->addChar(state, x1 + i*w1, y1 + i*h1, w1, h1, c, u[i]);
+      }

That happens if "if (i + 1 < uLen) {" is false? Do we lose a char? Or that 
should never happen and is an error? If it's an error i think we should have 
an else branch with something like
} else {
    error(-1, "Got surrogate pair start char but did not have second char")
}

Albert

> -----------
> Koji Otani.
>
>
> From: Ross Moore <ross at ics.mq.edu.au>
> Subject: Re: [poppler] pdftotext needs support for surrogates outside the
> BMP plane Date: Mon, 2 Jun 2008 15:53:54 +1000
> Message-ID: <308EB069-DD16-407F-B467-5B5F524F9887 at maths.mq.edu.au>
>
> ross> Hi Koji,
> ross>
> ross> On 02/06/2008, at 1:50 PM, Koji Otani wrote:
> ross> >
> ross> >
> ross> > From: Albert Astals Cid <aacid at kde.org>
> ross> > Subject: Re: [poppler] pdftotext needs support for surrogates
> ross> > outside the BMP plane
> ross> > Date: Sun, 1 Jun 2008 17:28:11 +0200
> ross> > Message-ID: <200806011728.11948.aacid at kde.org>
> ross> >
> ross> > aacid> A Dijous 29 Maig 2008, Koji Otani va escriure:
> ross> > aacid> > Hi, All.
> ross> > aacid> >
> ross> > aacid> > I'd like to commit this patch to the trunk tree.
> ross> > aacid> > Should I register this to Bugzilla before doing it?
> ross> > aacid>
> ross> > aacid> No, but i'd like to confirm that "it works" before commiting
> ross> > it, i can see
> ross> > aacid> that your patch gives a different output but i don't have
> ross> > any font installed
> ross> > aacid> in my system that can "draw" the characters, what font are
> ross> > you using?
> ross> > aacid>
> ross> > aacid> Albert
> ross> > aacid>
> ross> >
> ross> > Output is a UTF-8 text file. I don't have fonts that can draw this
> ross> > text
> ross> > file too. I checked if it is correct with a hexdump application.
> ross> >
> ross> > This problem was reported by Dr. Ross Moore. He viewed it with Mac
> ross> > text editor. but I can't view it with my Mac text-editor.
> ross> >
> ross> >> Dr. Ross Moore
> ross> >  What font are you using?
> ross>
> ross> I have several which can show these glyphs.
> ross>
> ross> In TextEdit, the default font that is being used is "Unicode
> Symbols", ross> as shown in one of the attached screenshots.
> ross> Get it from      http://users.teilar.gr/~g1951d/ .
> ross>
> ross> The other screenshot shows which fonts I have installed
> ross> that support Plane 1 characters.
> ross>
> ross>
> ross> Other possibilities are  Code200/Code2001/Code2002
> ross> e.g., from  http://www.code2000.net/code2001.htm .
> ross>
> ross> The STIX fonts are scheduled for release soon:
> ross>      http://www.stixfonts.org/rel_sched.html
> ross> (The beta testing release is no longer available.)
> ross>
> ross> Other free fonts are also available; e.g. Asana Math
> ross>    http://openfontlibrary.org/media/files/asyropoulos/219 .
> ross>
> ross> Or if you are prepared to try Microsoft's  "Cambria Math",
> ross> then that should work.
> ross>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler




More information about the poppler mailing list