[poppler] PDF files with embedded Chinese fonts

Mon Feb 9 07:12:38 PST 2009

Hi, I saw Ross' note about not being able to extract Chinese characters from
certain PDFs and just wanted to mention that I've seen the same.
Unfortunately I am unable to share the PDFs, and from Ross' note I'm not
quite sure how to check if it's the same problem.  But I can mention that I
have seen this problem with other languages, even English sometimes, too.
Most frequently I've seen poor text extraction from PDFs in Thai, though
some Thai PDFs do work.  I thought the problem might be a missing CMAP file
but from your description it sounds like that might not be the case, is that
correct?

I have also seen some Arabic text that I have not been able to interpret
correctly.  Arabic is written right to left, but when I open the XML from
pdftohtml, the characters are reversed.  That is, instead of 1234567 it
looks like 7654321.  Also, even after reversing the characters, I haven't
quite been able to match them up with the text as it appears in the PDF.
Has anyone else seen this?  Or have a clue as to what I might be doing
wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20090209/ae902a7b/attachment.html