[poppler] garbled text from pdftohtml/pdftotext
Ihar `Philips` Filipau
thephilips at gmail.com
Sat Mar 24 07:24:48 PDT 2012
I have encountered another strange PDF document. When viewing it in
graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally
But when I extract the content using the pdftotext or pdftohtml, the
text is garbled.
Little tinkering with the output, showed that ASCII characters as if
have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied
a simple script to add 29 to the characters and can now read most of
the text (except for the German umlauts; also some strange characters
appear in beginning of some lines).
I gather my question would be: what should I fix in pdftohtml to make
it print text properly?
P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
"Type 1C" and have the funny names:
More information about the poppler