[poppler] garbled text from pdftohtml/pdftotext

Ihar `Philips` Filipau thephilips at gmail.com
Sat Mar 24 07:24:48 PDT 2012


Hi All!

I have encountered another strange PDF document. When viewing it in
graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally
fine.

But when I extract the content using the pdftotext or pdftohtml, the
text is garbled.

Little tinkering with the output, showed that ASCII characters as if
have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied
a simple script to add 29 to the characters and can now read most of
the text (except for the German umlauts; also some strange characters
appear in beginning of some lines).

I gather my question would be: what should I fix in pdftohtml to make
it print text properly?


P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
"Type 1C" and have the funny names:
IKFZYK+MSTT31c39b00
ILOQIT+MSTT31c38e00
MBQOWW+MSTT31c38100


More information about the poppler mailing list