[Poppler-bugs] [Bug 46521] New: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Thu Feb 23 06:26:20 PST 2012
https://bugs.freedesktop.org/show_bug.cgi?id=46521
Bug #: 46521
Summary: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in
violation of RFC 3629.
Classification: Unclassified
Product: poppler
Version: unspecified
Platform: x86-64 (AMD64)
OS/Version: Linux (All)
Status: NEW
Severity: normal
Priority: medium
Component: pdftohtml
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: grubba at grubba.org
The output from pdftohtml for one of our pdfs contained the byte sequences:
\355\240\265\355\261\203 and \355\240\265\355\261\210
They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448
(MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper
UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210.
Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8
encoded surrogates. From RFC 3629 section 3:
The definition of UTF-8 prohibits encoding character numbers between U+D800
and U+DFFF, which are reserved for use with the UTF-16 encoding form (as
surrogate pairs) and do not directly represent characters. When encoding in
UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to
obtain character numbers, which are then encoded in UTF-8 as described above.
and:
Implementations of the decoding algorithm above MUST protect against decoding
invalid sequences. For instance, a naive implementation may decode the
overlong
UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1
8C
ED BE B4 into U+233B4. Decoding invalid sequences may have security
consequences or cause other problems. See Security Considerations (Section
10)
below.
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list