[poppler] extracting accent characters from PDFs

Sat Jun 7 17:29:11 PDT 2008

Hi all,

Now that version 0.8.3 is out, I'd like to return to this
issue, concerning the correct extraction of text containing
letters with diacritical marks, from PDFs.

>
>
> On 29/05/2008, at 9:43 AM, Ross Moore wrote:
>>
>>
>> Hi all.
>>
>> The attached PDF displays just a single word which includes
>> an accented character. It was created using (La)TeX as:  f\"ur
>>
>> Within the PDF it appears as a stream:
>>
>> stream
>> BT
>> /F15 10.9091 Tf 108.737 686 Td [(fu)556(^?r)]TJ
>> ET
>> endstream
>>
>>
>> <testaccents.pdf>
>>

For me the issue is not just extraction from newly created PDFs,
but also from those that exist in scientific archives, created
with tools that predate Unicode and UTF8. Hence the issues of
deciding: (a) which character to use to represent the accent;
and (b) where to place it within the output stream.

With the examples that I have tried, the best results are
obtained using:   pdftotext -raw

For example, on a slightly extended version of the PDF
from my previous posting, using  -raw  gives (correctly):

       für Löwen und Agnés

whereas not using -raw  gives either:

      fur Lowen und Agnes

or
      fur Lowen und Agnes   

according to whether  -layout  is used, or not.
( -raw  seems to override  -layout  so there
is no need to look at 4 separate cases.)

This example uses "combining accent" characters Ux00301 and Ux00308
obtained using a /ToUnicode CMap resource; but the actual characters
in the PDF stream are at different locations, as seen here:

Td
[(fu)528(^?)-28(r)-333(Lo)500(^?w)28(e)-1(n)-333(und)-333(Agne)472 
(\023)28(s)]
TJ

Yet the man-page for  pdftotext says that the use of
  -raw  is discouraged. So my first question is "why is this?".
What is the problem with  -raw  that it is not recommended,
when there are clearly good aspects about it?

Some work is needed to obtain the best text-extraction algorithm
that works in more general situations.

The second problem is that existing documents have no CMap resource,
and produce a PDF stream such as the following:

Td
[(f)-28(^?)528(ur)-333(L^?)500(ow)28(e)-1(n)-333(und)-333(Agn)28(\023) 
472(es)]
TJ

where now the accent character occurs *before* the letter,
rather than after it.
Thus simply putting a combining accent at this position in the
output stream does not give the correct visual representation.

There could be a switch to tell  pdftotext  to swap the order
of the accent character and the letter; but this isn't sufficiently
general to cope with all cases. For example, TeX has traditionally
placed over-accents before the letter, but under-accents after it.
And what about having multiple diacritic marks on the same letter?

Also, the "dot under" and "underbar" accents are produced by
placing the same character as used for "dot above" and "macron"
diacritics, but positioned below the letter.

Thus there are several issues that need to be handled to get the
"correct" text extraction from such PDFs.

I've prepared example PDFs for use in developing and testing
text-extraction. They can be found at:

   http://maths.mq.edu.au/~ross/poppler/

5019-e-cmap.pdf
5019-e-mmap.pdf
testaccents-OT1-mmap.pdf
testaccents-OT1-mmap.txt
testaccents-OT1-nocmap.pdf
testaccents-OT1-nocmap.txt
testaccents-OT1.pdf
testaccents-OT1.txt
testaccents-T1-mmap.pdf
testaccents-T1-mmap.txt
testaccents-T1-nocmap.pdf
testaccents-T1-nocmap.txt
testaccents-T1.pdf
testaccents-T1.txt

The .txt  files are the text extracted from the corresponding
  .pdf  using  pdftotext -raw .

Both TeX's OT1 and T1 font encodings have been used.
For each encoding there is a PDF with no CMap, and two PDFs
using different CMap resources. One maps to Unicode code-points,
while the other maps to ascii strings giving the TeX macro for
each character or accents.

The files  5019-e-cmap.pdf  and  5019-e-mmap.pdf  have a full
mathematical paper, using OT1 encoding and CMaps of the types
described in the above paragraph.

I would appreciate help from anyone involved with developing
the text-extraction for  poppler/Xpdf/pdftotext ;
that is, (I think) the coding in:
      poppler/TextOutputDev.cc

BTW, the .txt files above could not have been produced without
the new features in  poppler v.0.8.3 .
Thank you, to all involved in that release.

Best regards,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------