[poppler] extracting accent characters from PDFs
ross at ics.mq.edu.au
Sat Jun 7 17:29:11 PDT 2008
Now that version 0.8.3 is out, I'd like to return to this
issue, concerning the correct extraction of text containing
letters with diacritical marks, from PDFs.
> On 29/05/2008, at 9:43 AM, Ross Moore wrote:
>> Hi all.
>> The attached PDF displays just a single word which includes
>> an accented character. It was created using (La)TeX as: f\"ur
>> Within the PDF it appears as a stream:
>> /F15 10.9091 Tf 108.737 686 Td [(fu)556(^?r)]TJ
For me the issue is not just extraction from newly created PDFs,
but also from those that exist in scientific archives, created
with tools that predate Unicode and UTF8. Hence the issues of
deciding: (a) which character to use to represent the accent;
and (b) where to place it within the output stream.
With the examples that I have tried, the best results are
obtained using: pdftotext -raw
For example, on a slightly extended version of the PDF
from my previous posting, using -raw gives (correctly):
für Löwen und Agnés
whereas not using -raw gives either:
fur Lowen und Agnes
fur Lowen und Agnes
according to whether -layout is used, or not.
( -raw seems to override -layout so there
is no need to look at 4 separate cases.)
This example uses "combining accent" characters Ux00301 and Ux00308
obtained using a /ToUnicode CMap resource; but the actual characters
in the PDF stream are at different locations, as seen here:
Yet the man-page for pdftotext says that the use of
-raw is discouraged. So my first question is "why is this?".
What is the problem with -raw that it is not recommended,
when there are clearly good aspects about it?
Some work is needed to obtain the best text-extraction algorithm
that works in more general situations.
The second problem is that existing documents have no CMap resource,
and produce a PDF stream such as the following:
where now the accent character occurs *before* the letter,
rather than after it.
Thus simply putting a combining accent at this position in the
output stream does not give the correct visual representation.
There could be a switch to tell pdftotext to swap the order
of the accent character and the letter; but this isn't sufficiently
general to cope with all cases. For example, TeX has traditionally
placed over-accents before the letter, but under-accents after it.
And what about having multiple diacritic marks on the same letter?
Also, the "dot under" and "underbar" accents are produced by
placing the same character as used for "dot above" and "macron"
diacritics, but positioned below the letter.
Thus there are several issues that need to be handled to get the
"correct" text extraction from such PDFs.
I've prepared example PDFs for use in developing and testing
text-extraction. They can be found at:
The .txt files are the text extracted from the corresponding
.pdf using pdftotext -raw .
Both TeX's OT1 and T1 font encodings have been used.
For each encoding there is a PDF with no CMap, and two PDFs
using different CMap resources. One maps to Unicode code-points,
while the other maps to ascii strings giving the TeX macro for
each character or accents.
The files 5019-e-cmap.pdf and 5019-e-mmap.pdf have a full
mathematical paper, using OT1 encoding and CMaps of the types
described in the above paragraph.
I would appreciate help from anyone involved with developing
the text-extraction for poppler/Xpdf/pdftotext ;
that is, (I think) the coding in:
BTW, the .txt files above could not have been produced without
the new features in poppler v.0.8.3 .
Thank you, to all involved in that release.
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
More information about the poppler