[Poppler-bugs] [Bug 37900] pdftotext -htmlmeta and pdftohtml fail to decode U+2019

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Jun 6 17:05:31 PDT 2011


https://bugs.freedesktop.org/show_bug.cgi?id=37900

--- Comment #4 from Steven Murdoch <sjm217-freedesktop at srcf.ucam.org> 2011-06-06 17:05:31 PDT ---
Created an attachment (id=47627)
 View: https://bugs.freedesktop.org/attachment.cgi?id=47627
 Review: https://bugs.freedesktop.org/review?bug=37900&attachment=47627

Fix encoding of PDF document metadata in output of pdftohtml

pdftohtml simply copies the PDF document title into the <title> HTML
tag, which fails when the title is UCS-2 encoded, or if it contains
characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not
in ISO 8859-1.  This patch fixes the problem by decoding UCS-2 or
pdfDocEncoding into Unicode, then encoding this in the desired output
encoding.  HTML escaping wasn't being done either, so I have used the
existing function HtmlFont::HtmlFilter to perform both HTML escaping
and character set encoding. This static method had to be made public
to call it from pdftohtml. See bug #37900.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list