[Poppler-bugs] [Bug 18213] New: utils/ pdftotext outputs special characters regardless of output encoding

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Fri Oct 24 13:54:58 PDT 2008


http://bugs.freedesktop.org/show_bug.cgi?id=18213

           Summary: utils/pdftotext outputs special characters regardless of
                    output encoding
           Product: poppler
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: foobatzen at gmx.net


Using poppler 0.8.5, 0.9.3 and 0.10, the pdftotext util puts some special
characters like ^L into the resulting textfile.

To reproduce:

wget http://downloads.oreilly.com/make/08/pummer.pdf
pdftotext -enc UTF-8 pummer.pdf pummer.txt

Now pummer.txt contains the following text on line 13:

^LN MAKE VOLUME 06

I'm reporting this bug because I'd like to feed the output of pdftotext to
apache solr in order for lucene to index the text. But the apache solr
xmlreader doesn't handle these special characters within a UTF-8 xml
CDATA-field.

thanks!
ben


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list