<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <p> <div> <b><a class="bz_bug_link bz_status_NEW " title="NEW --- - Wrong selection with umlauts" href="https://bugs.freedesktop.org/show_bug.cgi?id=66569#c2">Comment # 2</a> on <a class="bz_bug_link bz_status_NEW " title="NEW --- - Wrong selection with umlauts" href="https://bugs.freedesktop.org/show_bug.cgi?id=66569">bug 66569</a> from <span class="vcard"><a class="email" href="mailto:kurt.pfeifle@gmail.com" title="kurt.pfeifle@gmail.com">kurt.pfeifle@gmail.com</a> </span></b> <pre>This PDF uses fonts using a non-standard encoding, being builtin to the fonts. This makes it rather tricky to convert the PDF to Text, or to extract or copy'n' paste from it, or to get a screen-reader to read aloud the document in question! Not even Apple or Adobe get it completely right: Pasting from Preview.app on a Mac, I get this: a ̈ o ̈ u ̈ ß • a ̈ • o ̈ • u ̈ •ß 1 Pasting from Adobe Acrobat Pro on a Mac, I get this: a o u a o u 1 Acrobat doesn't even display the bullet in front of the first item in the list (in front of the 'ä')! Using Poppler's pdftotext -layout achieves this: a¨ ¨ ou ¨ß • ¨ a • ¨ o • u ¨ • ß So even these have problems with getting pasting from LaTeX-originating PDFs right! The reason is this: very frequently, LaTeX uses "digraphs" to create composite characters, *NOT* the real umlaut glyphs (named 'adieresis', 'udieresis' and 'odieresis' in PDF parlance) which are provided by non-LaTeX fonts. To show you this, I uncompressed the page content stream and see this: 5 0 obj << /Length 570 >> stream BT /F8 9.9626 Tf 121.577 726.257 Td [<7f>]TJ 0 0.434 Td [(a)]TJ 8.302 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ 8.579 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)-333<19>]TJ/F14 9.9626 Tf -11.623 -21.918 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(a)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 9.963 -0.434 Td [<7f>]TJ 0 0.434 Td [(o)]TJ/F14 9.9626 Tf -9.963 -19.925 Td [<0f>]TJ/F8 9.9626 Tf 10.24 -0.434 Td [<7f>]TJ -0.277 0.434 Td [(u)]TJ/F14 9.9626 Tf -9.963 -19.926 Td [<0f>]TJ/F8 9.9626 Tf 9.963 0 Td [<19>]TJ 158.626 -486.177 Td [(1)]TJ ET endstream endobj What you can see here is that there is a frequent occurrence of the... ... <7f>, <19> and <0f> Hex character codes, ... these translate to '\177', '\031' and '\017' in Oktal, and ... translate to 'DEL', 'EM' and 'SI' in ASCII. I suspect one of these signs is meant to represent the 'ß' in the builtin font encoding, the next one is a 'bullet' and the last a 'dieresis' to construct the umlauts. To investigate further, I used this command: grep -a CharSet poppler-<a class="bz_bug_link bz_status_NEW " title="NEW --- - Wrong selection with umlauts" href="show_bug.cgi?id=66569">bug#66569</a>.pdf It gave this output: /CharSet (/a/dieresis/germandbls/o/one/u) /CharSet (/bullet) This confirms my suspicion: the embedded font 'CMR10' is subsetted to include only the glyphs for * 'a' * 'o' * 'u' * 'dieresis' * 'germandbls' (german double s == ß) * 'one' (at the bottom of the page the page number is shown) the other, CMSY10, only has one glyph: * 'bullet' LaTeX is good for preparing print- and read-ready PDF files. It is bad for creating PDFs which you want to make accessible: people who need accessibility features in their documents (f.e. to enable a screen reader) have the same problems as people who want to copy'n'paste from the documents. ---- Poppler may have many problems with copy'n'pasting text from PDFs. This issue here is not one of it...</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>