[Poppler-bugs] [Bug 46521] New: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in violation of RFC 3629.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Feb 23 06:26:20 PST 2012


https://bugs.freedesktop.org/show_bug.cgi?id=46521

             Bug #: 46521
           Summary: pdftohtml outputs UTF-8 encoded surrogates (UTF-16) in
                    violation of RFC 3629.
    Classification: Unclassified
           Product: poppler
           Version: unspecified
          Platform: x86-64 (AMD64)
        OS/Version: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: pdftohtml
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: grubba at grubba.org


The output from pdftohtml for one of our pdfs contained the byte sequences:

  \355\240\265\355\261\203 and \355\240\265\355\261\210

They seem to correspond to UTF-8 encoded surrogates for U0001d443 and U0001d448
(MATHEMATICAL ITALIC CAPITAL P and MATHEMATICAL ITALIC CAPITAL U). The proper
UTF-8 encoding for these characters is \360\235\222\203 and \360\235\221\210.

Many UTF-8 decoders and validators follow RFC 3629 and will reject UTF-8
encoded surrogates. From RFC 3629 section 3:

  The definition of UTF-8 prohibits encoding character numbers between U+D800
  and U+DFFF, which are reserved for use with the UTF-16 encoding form (as
  surrogate pairs) and do not directly represent characters. When encoding in
  UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to
  obtain character numbers, which are then encoded in UTF-8 as described above.

and:

  Implementations of the decoding algorithm above MUST protect against decoding
  invalid sequences. For instance, a naive implementation may decode the
overlong
  UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1
8C
  ED BE B4 into U+233B4. Decoding invalid sequences may have security
  consequences or cause other problems. See Security Considerations (Section
10)
  below.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list