[poppler] PDF files with embedded Chinese fonts
Robert Hawkins
rh2305 at columbia.edu
Mon Feb 9 07:12:38 PST 2009
Hi, I saw Ross' note about not being able to extract Chinese characters from
certain PDFs and just wanted to mention that I've seen the same.
Unfortunately I am unable to share the PDFs, and from Ross' note I'm not
quite sure how to check if it's the same problem. But I can mention that I
have seen this problem with other languages, even English sometimes, too.
Most frequently I've seen poor text extraction from PDFs in Thai, though
some Thai PDFs do work. I thought the problem might be a missing CMAP file
but from your description it sounds like that might not be the case, is that
correct?
I have also seen some Arabic text that I have not been able to interpret
correctly. Arabic is written right to left, but when I open the XML from
pdftohtml, the characters are reversed. That is, instead of 1234567 it
looks like 7654321. Also, even after reversing the characters, I haven't
quite been able to match them up with the text as it appears in the PDF.
Has anyone else seen this? Or have a clue as to what I might be doing
wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20090209/ae902a7b/attachment.html
More information about the poppler
mailing list