[poppler] garbled text from pdftohtml/pdftotext
igor.slepchin at gmail.com
Sun Mar 25 16:51:27 PDT 2012
Chances are the pdf you are working with isn't using a "standard"
character encoding in the embedded fonts. Check if the text search works
in evince or okular - my guess is it doesn't...
On 03/24/2012 10:24 AM, Ihar `Philips` Filipau wrote:
> Hi All!
> I have encountered another strange PDF document. When viewing it in
> graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally
> But when I extract the content using the pdftotext or pdftohtml, the
> text is garbled.
> Little tinkering with the output, showed that ASCII characters as if
> have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied
> a simple script to add 29 to the characters and can now read most of
> the text (except for the German umlauts; also some strange characters
> appear in beginning of some lines).
> I gather my question would be: what should I fix in pdftohtml to make
> it print text properly?
> P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
> "Type 1C" and have the funny names:
> poppler mailing list
> poppler at lists.freedesktop.org
More information about the poppler