[poppler] Source Code Roadmap/Arabic Text Decoding

Wed Oct 31 10:38:29 PDT 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Michael,

I am unsure whether I completely understand your problem, but if this
is about the mapping from logical to visual text order, the bug and
patch at [1] might interest you. (And may be also [2].) But as I said,
I am not really sure whether you mean that the extracted text has
encoding or ordering problems.

Best regards, Adam.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=55977

[2] https://bugs.freedesktop.org/show_bug.cgi?id=2981

On 31.10.2012 18:28, Michael Younkin wrote:
> Hello,
> 
> We have been doing some work using Poppler's pdftotext tool with
> the -html option to extract text with bounding box coordinates from
> PDF files. Later on we match up these pieces of text and
> coordinates with versions of the PDF files converted to images.
> 
> We are working with multiple languages, but right now we are
> focusing on Arabic. We are having a couple of problems with the
> encodings of Arabic characters. Sometimes all of the Unicode code
> points will be in the wrong order, and other times some characters
> have their code points backwards and some do not.
> 
> According to our Arabic speaking Annotators, when we render the
> images the text appears correct, but when text from the pdftotext
> tool is matched against these renderings, we encounter the problems
> I stated above.
> 
> I imagine that these issues are related to how PDF files are
> encoded and little to do with how pdftotext is extracting the text
> from PDF files. Does anyone have any suggestions for dealing with
> these issues? We can resolve some of them manually pretty quickly,
> but sometimes when the code points are in a seemingly random order
> all we can do is retype them, which is very time consuming as we
> are hoping to process hundreds of PDF file pages.
> 
> Could someone also point me to where in the poppler code characters
> get extracted from the PDF file? We don't really know if it is just
> how we are using pdftotext that is causing the issues, if there is
> something that could be improved in the code, or if there is simply
> nothing that can be done. We have done some research and found that
> Apache's PDFBox can correct some of the issues we have been facing,
> but we are still investigating the code to see what they are doing
> to fix the problems.
> 
> Thank you very much for your help!
> 
> Michael Younkin
> 
> 
> _______________________________________________ poppler mailing
> list poppler at lists.freedesktop.org 
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iQEcBAEBAgAGBQJQkWIVAAoJEPSSjE3STU34imoIAIeoMYdIsB99itk7LXaoQwoy
vXp9J9E29bdeKbbKPIgvpRdav3Z+mx7hhGEpMHmiw+CS7DvKeHIrQqSHKzNxtKBi
5tyWbFIMV8CzsA/AUhfB/zRqcdaK+e/3puMnTUeT4nHL0uaYrVJIPQqTXT7IWqrK
CIxRvIjjnag7rLgjYFlymIAc3XSQwBcZhvOch2BQxp7kxwfdMoW7xLiSmSZSVjTn
xIUadWAl7gSBRFgHPLKSMf07YoLwxDi6AntAyf+/Y9Xo+Ih+Mx0tlJFZ5E5T/z9U
a8F9htRUkfhbHew8NFAYySq9FPHfQ4sJqdYoJR9SvATsq1Pd2BU1WuvfoqDNoDU=
=/YsJ
-----END PGP SIGNATURE-----