<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Wrong selection with umlauts"
href="https://bugs.freedesktop.org/show_bug.cgi?id=66569#c2">Comment # 2</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Wrong selection with umlauts"
href="https://bugs.freedesktop.org/show_bug.cgi?id=66569">bug 66569</a>
from <span class="vcard"><a class="email" href="mailto:kurt.pfeifle@gmail.com" title="kurt.pfeifle@gmail.com">kurt.pfeifle@gmail.com</a>
</span></b>
<pre>This PDF uses fonts using a non-standard encoding, being builtin to the fonts.
This makes it rather tricky to convert the PDF to Text, or to extract or
copy'n' paste from it, or to get a screen-reader to read aloud the document in
question! Not even Apple or Adobe get it completely right:
Pasting from Preview.app on a Mac, I get this:
a ̈ o ̈ u ̈ ß • a ̈
• o ̈ • u ̈ •ß
1
Pasting from Adobe Acrobat Pro on a Mac, I get this:
a o u
a
o
u
1
Acrobat doesn't even display the bullet in front of the first item in the list
(in front of the 'ä')!
Using Poppler's pdftotext -layout achieves this:
a¨
¨ ou
¨ß
• ¨
a
• ¨
o
• u
¨
• ß
So even these have problems with getting pasting from LaTeX-originating PDFs
right!
The reason is this: very frequently, LaTeX uses "digraphs" to create composite
characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and
'odieresis' in PDF parlance) which are provided by non-LaTeX fonts.
To show you this, I uncompressed the page content stream and see this:
5 0 obj
<<
/Length 570
>>
stream
BT
/F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td
[<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td
[(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963
-0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td
[<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf
-9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434
Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td
[<19>]TJ 158.626 -486.177 Td [(1)]TJ
ET
endstream
endobj
What you can see here is that there is a frequent occurrence of the...
... <7f>, <19> and <0f> Hex character codes,
... these translate to '\177', '\031' and '\017' in Oktal, and
... translate to 'DEL', 'EM' and 'SI' in ASCII.
I suspect one of these signs is meant to represent the 'ß' in the builtin font
encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the
umlauts. To investigate further, I used this command:
grep -a CharSet poppler-<a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Wrong selection with umlauts"
href="show_bug.cgi?id=66569">bug#66569</a>.pdf
It gave this output:
/CharSet (/a/dieresis/germandbls/o/one/u)
/CharSet (/bullet)
This confirms my suspicion: the embedded font 'CMR10' is subsetted to include
only
the glyphs for
* 'a'
* 'o'
* 'u'
* 'dieresis'
* 'germandbls' (german double s == ß)
* 'one' (at the bottom of the page the page number is shown)
the other, CMSY10, only has one glyph:
* 'bullet'
LaTeX is good for preparing print- and read-ready PDF files. It is bad for
creating PDFs which you want to make accessible: people who need accessibility
features in their documents (f.e. to enable a screen reader) have the same
problems as people who want to copy'n'paste from the documents.
----
Poppler may have many problems with copy'n'pasting text from PDFs. This issue
here is not one of it...</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>