Hello,<div><br></div><div>We have been doing some work using Poppler's pdftotext tool with the -html option to extract text with bounding box coordinates from PDF files. Later on we match up these pieces of text and coordinates with versions of the PDF files converted to images.</div>
<div><br></div><div>We are working with multiple languages, but right now we are focusing on Arabic. We are having a couple of problems with the encodings of Arabic characters. Sometimes all of the Unicode code points will be in the wrong order, and other times some characters have their code points backwards and some do not.</div>
<div><br></div><div>According to our Arabic speaking Annotators, when we render the images the text appears correct, but when text from the pdftotext tool is matched against these renderings, we encounter the problems I stated above.</div>
<div><br></div><div>I imagine that these issues are related to how PDF files are encoded and little to do with how pdftotext is extracting the text from PDF files. Does anyone have any suggestions for dealing with these issues? We can resolve some of them manually pretty quickly, but sometimes when the code points are in a seemingly random order all we can do is retype them, which is very time consuming as we are hoping to process hundreds of PDF file pages. </div>
<div><br></div><div>Could someone also point me to where in the poppler code characters get extracted from the PDF file? We don't really know if it is just how we are using pdftotext that is causing the issues, if there is something that could be improved in the code, or if there is simply nothing that can be done. We have done some research and found that Apache's PDFBox can correct some of the issues we have been facing, but we are still investigating the code to see what they are doing to fix the problems.</div>
<div><br></div><div>Thank you very much for your help!</div><div><br></div><div>Michael Younkin</div>