[poppler] garbled text from pdftohtml/pdftotext

Sun Mar 25 16:51:27 PDT 2012

Chances are the pdf you are working with isn't using a "standard" 
character encoding in the embedded fonts. Check if the text search works 
in evince or okular - my guess is it doesn't...

On 03/24/2012 10:24 AM, Ihar `Philips` Filipau wrote:
> Hi All!
>
> I have encountered another strange PDF document. When viewing it in
> graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally
> fine.
>
> But when I extract the content using the pdftotext or pdftohtml, the
> text is garbled.
>
> Little tinkering with the output, showed that ASCII characters as if
> have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied
> a simple script to add 29 to the characters and can now read most of
> the text (except for the German umlauts; also some strange characters
> appear in beginning of some lines).
>
> I gather my question would be: what should I fix in pdftohtml to make
> it print text properly?
>
>
> P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
> "Type 1C" and have the funny names:
> IKFZYK+MSTT31c39b00
> ILOQIT+MSTT31c38e00
> MBQOWW+MSTT31c38100
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler