[poppler] Source Code Roadmap/Arabic Text Decoding
adamreichold at myopera.com
Wed Oct 31 10:38:29 PDT 2012
-----BEGIN PGP SIGNED MESSAGE-----
I am unsure whether I completely understand your problem, but if this
is about the mapping from logical to visual text order, the bug and
patch at  might interest you. (And may be also .) But as I said,
I am not really sure whether you mean that the extracted text has
encoding or ordering problems.
Best regards, Adam.
On 31.10.2012 18:28, Michael Younkin wrote:
> We have been doing some work using Poppler's pdftotext tool with
> the -html option to extract text with bounding box coordinates from
> PDF files. Later on we match up these pieces of text and
> coordinates with versions of the PDF files converted to images.
> We are working with multiple languages, but right now we are
> focusing on Arabic. We are having a couple of problems with the
> encodings of Arabic characters. Sometimes all of the Unicode code
> points will be in the wrong order, and other times some characters
> have their code points backwards and some do not.
> According to our Arabic speaking Annotators, when we render the
> images the text appears correct, but when text from the pdftotext
> tool is matched against these renderings, we encounter the problems
> I stated above.
> I imagine that these issues are related to how PDF files are
> encoded and little to do with how pdftotext is extracting the text
> from PDF files. Does anyone have any suggestions for dealing with
> these issues? We can resolve some of them manually pretty quickly,
> but sometimes when the code points are in a seemingly random order
> all we can do is retype them, which is very time consuming as we
> are hoping to process hundreds of PDF file pages.
> Could someone also point me to where in the poppler code characters
> get extracted from the PDF file? We don't really know if it is just
> how we are using pdftotext that is causing the issues, if there is
> something that could be improved in the code, or if there is simply
> nothing that can be done. We have done some research and found that
> Apache's PDFBox can correct some of the issues we have been facing,
> but we are still investigating the code to see what they are doing
> to fix the problems.
> Thank you very much for your help!
> Michael Younkin
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/
-----END PGP SIGNATURE-----
More information about the poppler