[poppler] extracting accent characters from PDFs

Ross Moore ross at ics.mq.edu.au
Wed May 28 16:43:50 PDT 2008


Hi all.

The attached PDF displays just a single word which includes
an accented character. It was created using (La)TeX as:  f\"ur

Within the PDF it appears as a stream:

stream
BT
/F15 10.9091 Tf 108.737 686 Td [(fu)556(^?r)]TJ
ET
endstream


-------------- next part --------------
A non-text attachment was scrubbed...
Name: testaccents.pdf
Type: application/pdf
Size: 6436 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080529/16aedd72/attachment.pdf 
-------------- next part --------------


The diaeresis accent is encoded as follows:

/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 127 /dieresis put
dup 102 /f put
dup 114 /r put
dup 117 /u put
readonly def

and has a corresponding CMap entry

<7F> <0308>

mapping to the "combining diaeresis" character.


The result of extracting text using  pdftotext  is interesting.

This is "correct", using the -raw  option:

rossmoor% pdftotext -layout -raw testaccents.pdf
rossmoor% more testaccents.txt
fu<CC><88>r
^L

... but it comes out wrong with default options:

rossmoor% pdftotext testaccents.pdf
rossmoor% more testaccents.txt
fur <CC><88>

^L

rossmoor% pdftotext -layout testaccents.pdf
rossmoor% more testaccents.txt
fur
  <CC><88>
^L


The man page says:

  -raw   Keep the text in content stream order.  This  is  a  hack   
which
         often  "undoes"  column  formatting, etc.  Use of raw mode  
is no
         longer recommended.

Yet it is precisely use of  -raw  which gets this situation correct.

So my questions are:

    Why is the accent wrongly placed, by default?
    What makes it go to after the containing word, or next line ?

    If  -raw  is not recommended for bad effects in some situations,
    what is the replacement for when it *is* appropriate ?


Thanks in advance for any help in getting this fixed.


Regards,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the poppler mailing list