[poppler] garbled text from pdftohtml/pdftotext
suzuki toshiya
mpsuzuki at hiroshima-u.ac.jp
Sat Mar 24 07:32:04 PDT 2012
Hi,
Could you let me know where I could download some sample?
Ihar `Philips` Filipau wrote:
> P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
> "Type 1C" and have the funny names:
> IKFZYK+MSTT31c39b00
> ILOQIT+MSTT31c38e00
> MBQOWW+MSTT31c38100
Please do not call them as "funny" :-), such names are quite popular
in the PDFs that TrueType fonts were converted to PostScript Type1
fonts in embedding. I'm afraid the PDFs are generated without the
consideration about the text extraction, and, if my guessing is right,
even Adobe products cannot extract the texts.
Regards,
mpsuzuki
Ihar `Philips` Filipau wrote:
> Hi All!
>
> I have encountered another strange PDF document. When viewing it in
> graphical viewers like Okular/Evince/Reader/FoxIt - it looks totally
> fine.
>
> But when I extract the content using the pdftotext or pdftohtml, the
> text is garbled.
>
> Little tinkering with the output, showed that ASCII characters as if
> have being shifted by 29. E.g. '5' (0x35) became 0x18. I have applied
> a simple script to add 29 to the characters and can now read most of
> the text (except for the German umlauts; also some strange characters
> appear in beginning of some lines).
>
> I gather my question would be: what should I fix in pdftohtml to make
> it print text properly?
>
>
> P.S. Okular (KDE 4.7.4) also showed that the embedded subset fonts are
> "Type 1C" and have the funny names:
> IKFZYK+MSTT31c39b00
> ILOQIT+MSTT31c38e00
> MBQOWW+MSTT31c38100
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list