[Poppler-bugs] [Bug 37900] pdftotext -htmlmeta and pdftohtml fail to decode U+2019
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Mon Jun 6 17:05:31 PDT 2011
https://bugs.freedesktop.org/show_bug.cgi?id=37900
--- Comment #4 from Steven Murdoch <sjm217-freedesktop at srcf.ucam.org> 2011-06-06 17:05:31 PDT ---
Created an attachment (id=47627)
View: https://bugs.freedesktop.org/attachment.cgi?id=47627
Review: https://bugs.freedesktop.org/review?bug=37900&attachment=47627
Fix encoding of PDF document metadata in output of pdftohtml
pdftohtml simply copies the PDF document title into the <title> HTML
tag, which fails when the title is UCS-2 encoded, or if it contains
characters which are in pdfDocEncoding (a ISO 8859-1 superset), but not
in ISO 8859-1. This patch fixes the problem by decoding UCS-2 or
pdfDocEncoding into Unicode, then encoding this in the desired output
encoding. HTML escaping wasn't being done either, so I have used the
existing function HtmlFont::HtmlFilter to perform both HTML escaping
and character set encoding. This static method had to be made public
to call it from pdftohtml. See bug #37900.
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list