Michael Younkin younkin.michael at gmail.com
Wed Oct 31 10:28:33 PDT 2012


We have been doing some work using Poppler's pdftotext tool with the -html
option to extract text with bounding box coordinates from PDF files. Later
on we match up these pieces of text and coordinates with versions of the
PDF files converted to images.

We are working with multiple languages, but right now we are focusing on
Arabic. We are having a couple of problems with the encodings of Arabic
characters. Sometimes all of the Unicode code points will be in the wrong
order, and other times some characters have their code points backwards and
some do not.

According to our Arabic speaking Annotators, when we render the images the
text appears correct, but when text from the pdftotext tool is matched
against these renderings, we encounter the problems I stated above.

I imagine that these issues are related to how PDF files are encoded and
little to do with how pdftotext is extracting the text from PDF files. Does
anyone have any suggestions for dealing with these issues? We can resolve
some of them manually pretty quickly, but sometimes when the code points
are in a seemingly random order all we can do is retype them, which is very
time consuming as we are hoping to process hundreds of PDF file pages.

Could someone also point me to where in the poppler code characters get
extracted from the PDF file? We don't really know if it is just how we are
using pdftotext that is causing the issues, if there is something that
could be improved in the code, or if there is simply nothing that can be
done. We have done some research and found that Apache's PDFBox can correct
some of the issues we have been facing, but we are still investigating the
code to see what they are doing to fix the problems.

Thank you very much for your help!

Michael Younkin
