<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body><table border="1" cellspacing="0" cellpadding="8"> <tr> <th>Bug ID</th> <td><a class="bz_bug_link bz_status_NEW " title="NEW - rendering pdf and pdftotext give different results" href="https://bugs.freedesktop.org/show_bug.cgi?id=104085">104085</a> </td> </tr> <tr> <th>Summary</th> <td>rendering pdf and pdftotext give different results </td> </tr> <tr> <th>Product</th> <td>poppler </td> </tr> <tr> <th>Version</th> <td>unspecified </td> </tr> <tr> <th>Hardware</th> <td>x86-64 (AMD64) </td> </tr> <tr> <th>OS</th> <td>Linux (All) </td> </tr> <tr> <th>Status</th> <td>NEW </td> </tr> <tr> <th>Severity</th> <td>normal </td> </tr> <tr> <th>Priority</th> <td>medium </td> </tr> <tr> <th>Component</th> <td>utils </td> </tr> <tr> <th>Assignee</th> <td>poppler-bugs@lists.freedesktop.org </td> </tr> <tr> <th>Reporter</th> <td>galtgendo@o2.pl </td> </tr></table> <p> <div> <pre>...well, of course they do - one renders pdf graphicallly the other just tries to extract the text... However, the issue is this: I've stumbled upon a pdf file, that's displayed correctly, but pdftotext was dumping strings, that looked like typos, if not for the "typo" being the same char. So, I've looked into the content. 699 0 obj << /BaseEncoding /WinAnsiEncoding /Differences [ 1 /zdot /aogonek /eogonek /sacute /cacute /Sacute /nacute /Zdot /zacute /Zacute ] /Type /Encoding <span class="quote">>></span > endobj 700 0 obj << /Ascent 625 /CapHeight 625 /Descent -177 /Flags 4 /FontBBox [ 5 -177 638 877 ] /FontFile2 712 0 R /FontName /RDZRPI+TimesNewRoman /ItalicAngle 0 /MissingWidth 777 /StemV 95 /Type /FontDescriptor <span class="quote">>></span > endobj 701 0 obj << /Length 702 0 R <span class="quote">>></span > stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CMapType 2 def /CMapName/R706 def 1 begincodespacerange <00><ff> endcodespacerange 10 beginbfrange <01><01><015c> <02><02><0105> <03><03><0119> <04><04><015b> <05><05><0107> <06><06><015a> <07><07><0144> <08><08><015b> <09><09><017a> <0a><0a><0179> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj Well, that's just one of a few such sets. The point is that - for example - '\zdot' should be '017c' or at least changing it to that gives proper results in pdftotext. pdf file modified that way still displays correctly. So, is there a step that pdftotext is skipping, that it could be doing to get the proper result or is it something that only works during on-screen rendering ?</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>