[poppler] extracting accent characters from PDFs
Ross Moore
ross at ics.mq.edu.au
Wed May 28 16:43:50 PDT 2008
Hi all.
The attached PDF displays just a single word which includes
an accented character. It was created using (La)TeX as: f\"ur
Within the PDF it appears as a stream:
stream
BT
/F15 10.9091 Tf 108.737 686 Td [(fu)556(^?r)]TJ
ET
endstream
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testaccents.pdf
Type: application/pdf
Size: 6436 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080529/16aedd72/attachment.pdf
-------------- next part --------------
The diaeresis accent is encoded as follows:
/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 127 /dieresis put
dup 102 /f put
dup 114 /r put
dup 117 /u put
readonly def
and has a corresponding CMap entry
<7F> <0308>
mapping to the "combining diaeresis" character.
The result of extracting text using pdftotext is interesting.
This is "correct", using the -raw option:
rossmoor% pdftotext -layout -raw testaccents.pdf
rossmoor% more testaccents.txt
fu<CC><88>r
^L
... but it comes out wrong with default options:
rossmoor% pdftotext testaccents.pdf
rossmoor% more testaccents.txt
fur <CC><88>
^L
rossmoor% pdftotext -layout testaccents.pdf
rossmoor% more testaccents.txt
fur
<CC><88>
^L
The man page says:
-raw Keep the text in content stream order. This is a hack
which
often "undoes" column formatting, etc. Use of raw mode
is no
longer recommended.
Yet it is precisely use of -raw which gets this situation correct.
So my questions are:
Why is the accent wrongly placed, by default?
What makes it go to after the containing word, or next line ?
If -raw is not recommended for bad effects in some situations,
what is the replacement for when it *is* appropriate ?
Thanks in advance for any help in getting this fixed.
Regards,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list