[poppler] extracting accent characters from PDFs

Wed Jun 11 17:11:42 PDT 2008

On 08/06/2008, at 10:29 AM, Ross Moore wrote:

> With the examples that I have tried, the best results are
> obtained using:   pdftotext -raw
>
> For example, on a slightly extended version of the PDF
> from my previous posting, using  -raw  gives (correctly):
>
>       für Löwen und Agnés
>
> whereas not using -raw  gives either:
>
>      fur Lowen und Agnes
>                        or
>      fur Lowen und Agnes

The bare accent characters have been stripped in the email,
from the above lines. Here's a different representation:

      fur Lowen und Agnes
       <CC><88> <CC><88>                 <CC><81>

                       or

     fur Lowen und Agnes <CC><88> <CC><88> <CC><81>

>
> according to whether  -layout  is used, or not.
> ( -raw  seems to override  -layout  so there
> is no need to look at 4 separate cases.)

> There could be a switch to tell  pdftotext  to swap the order
> of the accent character and the letter; but this isn't sufficiently
> general to cope with all cases. For example, TeX has traditionally
> placed over-accents before the letter, but under-accents after it.
> And what about having multiple diacritic marks on the same letter?
>
> Also, the "dot under" and "underbar" accents are produced by
> placing the same character as used for "dot above" and "macron"
> diacritics, but positioned below the letter.
>
> Thus there are several issues that need to be handled to get the
> "correct" text extraction from such PDFs.

That is, both the layout and the original stream order
must be considered, perhaps also using extra knowledge
of how the PDF was generated.

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------