[Poppler-bugs] [Bug 66569] Wrong selection with umlauts
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Thu Nov 7 14:34:17 PST 2013
https://bugs.freedesktop.org/show_bug.cgi?id=66569
--- Comment #2 from kurt.pfeifle at gmail.com ---
This PDF uses fonts using a non-standard encoding, being builtin to the fonts.
This makes it rather tricky to convert the PDF to Text, or to extract or
copy'n' paste from it, or to get a screen-reader to read aloud the document in
question! Not even Apple or Adobe get it completely right:
Pasting from Preview.app on a Mac, I get this:
a ̈ o ̈ u ̈ ß • a ̈
• o ̈ • u ̈ •ß
1
Pasting from Adobe Acrobat Pro on a Mac, I get this:
a o u
a
o
u
1
Acrobat doesn't even display the bullet in front of the first item in the list
(in front of the 'ä')!
Using Poppler's pdftotext -layout achieves this:
a¨
¨ ou
¨ß
• ¨
a
• ¨
o
• u
¨
• ß
So even these have problems with getting pasting from LaTeX-originating PDFs
right!
The reason is this: very frequently, LaTeX uses "digraphs" to create composite
characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and
'odieresis' in PDF parlance) which are provided by non-LaTeX fonts.
To show you this, I uncompressed the page content stream and see this:
5 0 obj
<<
/Length 570
>>
stream
BT
/F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td
[<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td
[(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963
-0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td
[<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf
-9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434
Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td
[<19>]TJ 158.626 -486.177 Td [(1)]TJ
ET
endstream
endobj
What you can see here is that there is a frequent occurrence of the...
... <7f>, <19> and <0f> Hex character codes,
... these translate to '\177', '\031' and '\017' in Oktal, and
... translate to 'DEL', 'EM' and 'SI' in ASCII.
I suspect one of these signs is meant to represent the 'ß' in the builtin font
encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the
umlauts. To investigate further, I used this command:
grep -a CharSet poppler-bug#66569.pdf
It gave this output:
/CharSet (/a/dieresis/germandbls/o/one/u)
/CharSet (/bullet)
This confirms my suspicion: the embedded font 'CMR10' is subsetted to include
only
the glyphs for
* 'a'
* 'o'
* 'u'
* 'dieresis'
* 'germandbls' (german double s == ß)
* 'one' (at the bottom of the page the page number is shown)
the other, CMSY10, only has one glyph:
* 'bullet'
LaTeX is good for preparing print- and read-ready PDF files. It is bad for
creating PDFs which you want to make accessible: people who need accessibility
features in their documents (f.e. to enable a screen reader) have the same
problems as people who want to copy'n'paste from the documents.
----
Poppler may have many problems with copy'n'pasting text from PDFs. This issue
here is not one of it...
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131107/8d25d738/attachment.html>
More information about the Poppler-bugs
mailing list