Hi, I saw Ross' note about not being able to extract Chinese characters from certain PDFs and just wanted to mention that I've seen the same. Unfortunately I am unable to share the PDFs, and from Ross' note I'm not quite sure how to check if it's the same problem. But I can mention that I have seen this problem with other languages, even English sometimes, too. Most frequently I've seen poor text extraction from PDFs in Thai, though some Thai PDFs do work. I thought the problem might be a missing CMAP file but from your description it sounds like that might not be the case, is that correct?<br>
<br>I have also seen some Arabic text that I have not been able to interpret correctly. Arabic is written right to left, but when I open the XML from pdftohtml, the characters are reversed. That is, instead of 1234567 it looks like 7654321. Also, even after reversing the characters, I haven't quite been able to match them up with the text as it appears in the PDF. Has anyone else seen this? Or have a clue as to what I might be doing wrong?<br>