[Poppler-bugs] [Bug 103309] New: pdftotext: UTF-16 text without BOM not properly extracted

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Oct 17 10:22:13 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=103309

            Bug ID: 103309
           Summary: pdftotext: UTF-16 text without BOM not properly
                    extracted
           Product: poppler
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: utils
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: ralf.stubner at r-institute.com

Created attachment 134881
  --> https://bugs.freedesktop.org/attachment.cgi?id=134881&action=edit
Sample file

When I use pdftotext with the attached sample file I get no usable text. When
looking at the file with a hex editor, I can see that the text is available as
UTF-16BE *without* BOM. The display with xpdf is fine.

Tested with version 0.48.0 (Debian Stable) and 0.57.0 (Debian Testing).

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20171017/2d163cb7/attachment.html>


More information about the Poppler-bugs mailing list