[Poppler-bugs] [Bug 18213] New: utils/ pdftotext outputs special characters regardless of output encoding
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Fri Oct 24 13:54:58 PDT 2008
http://bugs.freedesktop.org/show_bug.cgi?id=18213
Summary: utils/pdftotext outputs special characters regardless of
output encoding
Product: poppler
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: medium
Component: general
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: foobatzen at gmx.net
Using poppler 0.8.5, 0.9.3 and 0.10, the pdftotext util puts some special
characters like ^L into the resulting textfile.
To reproduce:
wget http://downloads.oreilly.com/make/08/pummer.pdf
pdftotext -enc UTF-8 pummer.pdf pummer.txt
Now pummer.txt contains the following text on line 13:
^LN MAKE VOLUME 06
I'm reporting this bug because I'd like to feed the output of pdftotext to
apache solr in order for lucene to index the text. But the apache solr
xmlreader doesn't handle these special characters within a UTF-8 xml
CDATA-field.
thanks!
ben
--
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list