[poppler] extracting accent characters from PDFs
Ross Moore
ross at ics.mq.edu.au
Wed Jun 11 17:11:42 PDT 2008
On 08/06/2008, at 10:29 AM, Ross Moore wrote:
> With the examples that I have tried, the best results are
> obtained using: pdftotext -raw
>
> For example, on a slightly extended version of the PDF
> from my previous posting, using -raw gives (correctly):
>
> für Löwen und Agnés
>
> whereas not using -raw gives either:
>
> fur Lowen und Agnes
> or
> fur Lowen und Agnes
The bare accent characters have been stripped in the email,
from the above lines. Here's a different representation:
fur Lowen und Agnes
<CC><88> <CC><88> <CC><81>
or
fur Lowen und Agnes <CC><88> <CC><88> <CC><81>
>
> according to whether -layout is used, or not.
> ( -raw seems to override -layout so there
> is no need to look at 4 separate cases.)
> There could be a switch to tell pdftotext to swap the order
> of the accent character and the letter; but this isn't sufficiently
> general to cope with all cases. For example, TeX has traditionally
> placed over-accents before the letter, but under-accents after it.
> And what about having multiple diacritic marks on the same letter?
>
> Also, the "dot under" and "underbar" accents are produced by
> placing the same character as used for "dot above" and "macron"
> diacritics, but positioned below the letter.
>
> Thus there are several issues that need to be handled to get the
> "correct" text extraction from such PDFs.
That is, both the layout and the original stream order
must be considered, perhaps also using extra knowledge
of how the PDF was generated.
Hope this helps,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list