[Poppler-bugs] [Bug 66569] Wrong selection with umlauts

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Nov 7 14:34:17 PST 2013


https://bugs.freedesktop.org/show_bug.cgi?id=66569

--- Comment #2 from kurt.pfeifle at gmail.com ---
This PDF uses fonts using a non-standard encoding, being builtin to the fonts.

This makes it rather tricky to convert the PDF to Text, or to extract or
copy'n' paste from it, or to get a screen-reader to read aloud the document in
question! Not even Apple or Adobe get it completely right:

Pasting from Preview.app on a Mac, I get this:

  a ̈ o ̈ u ̈ ß • a ̈
  • o ̈ • u ̈ •ß
  1

Pasting from Adobe Acrobat Pro on a Mac, I get this:

  a o u 
   a
   o
   u
   
  1

Acrobat doesn't even display the bullet in front of the first item in the list
(in front of the 'ä')!

Using Poppler's pdftotext -layout achieves this:

  a¨
  ¨ ou
     ¨ß

  • ¨
    a
  • ¨
    o
  • u
    ¨
  • ß

So even these have problems with getting pasting from LaTeX-originating PDFs
right!

The reason is this: very frequently, LaTeX uses "digraphs" to create composite
characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and
'odieresis' in PDF parlance) which are provided by non-LaTeX fonts.

To show you this, I uncompressed the page content stream and see this:

  5 0 obj
  <<
    /Length 570
  >>
  stream
  BT
  /F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td
[<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td
[(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963
-0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td
[<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf
-9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434
Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td
[<19>]TJ 158.626 -486.177 Td [(1)]TJ
  ET
  endstream
  endobj

What you can see here is that there is a frequent occurrence of the...

 ... <7f>, <19> and <0f> Hex character codes,
 ... these translate to '\177', '\031' and '\017' in Oktal, and
 ... translate to 'DEL', 'EM' and 'SI' in ASCII. 

I suspect one of these signs is meant to represent the 'ß' in the builtin font
encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the
umlauts. To investigate further, I used this command:

  grep -a CharSet poppler-bug#66569.pdf

It gave this output:

  /CharSet (/a/dieresis/germandbls/o/one/u)
  /CharSet (/bullet)

This confirms my suspicion: the embedded font 'CMR10' is subsetted to include
only
the glyphs for

 * 'a'
 * 'o'
 * 'u'
 * 'dieresis'
 * 'germandbls' (german double s == ß)
 * 'one'  (at the bottom of the page the page number is shown)

the other, CMSY10, only has one glyph:

 * 'bullet'

LaTeX is good for preparing print- and read-ready PDF files. It is bad for
creating PDFs which you want to make accessible: people who need accessibility
features in their documents (f.e. to enable a screen reader) have the same
problems as people who want to copy'n'paste from the documents.

----

Poppler may have many problems with copy'n'pasting text from PDFs. This issue
here is not one of it...

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131107/8d25d738/attachment.html>


More information about the Poppler-bugs mailing list