<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - rendering pdf and pdftotext give different results"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=104085">104085</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>rendering pdf and pdftotext give different results
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>poppler
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>x86-64 (AMD64)
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux (All)
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>medium
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>utils
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>poppler-bugs@lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>galtgendo@o2.pl
          </td>
        </tr></table>
      <p>
        <div>
        <pre>...well, of course they do - one renders pdf graphicallly the other just tries
to extract the text...

However, the issue is this: I've stumbled upon a pdf file, that's displayed
correctly, but pdftotext was dumping strings, that looked like typos, if not
for the "typo" being the same char.

So, I've looked into the content.

699 0 obj
<<
  /BaseEncoding /WinAnsiEncoding
  /Differences [
    1
    /zdot
    /aogonek
    /eogonek
    /sacute
    /cacute
    /Sacute
    /nacute
    /Zdot
    /zacute
    /Zacute
  ]
  /Type /Encoding
<span class="quote">>></span >
endobj

700 0 obj
<<
  /Ascent 625
  /CapHeight 625
  /Descent -177
  /Flags 4
  /FontBBox [
    5
    -177
    638
    877
  ]
  /FontFile2 712 0 R
  /FontName /RDZRPI+TimesNewRoman
  /ItalicAngle 0
  /MissingWidth 777
  /StemV 95
  /Type /FontDescriptor
<span class="quote">>></span >
endobj

701 0 obj
<<
  /Length 702 0 R
<span class="quote">>></span >
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R706 def
1 begincodespacerange
<00><ff>
endcodespacerange
10 beginbfrange
<01><01><015c>
<02><02><0105>
<03><03><0119>
<04><04><015b>
<05><05><0107>
<06><06><015a>
<07><07><0144>
<08><08><015b>
<09><09><017a>
<0a><0a><0179>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj

Well, that's just one of a few such sets. The point is that - for example -
'\zdot' should be '017c' or at least changing it to that gives proper results
in pdftotext. pdf file modified that way still displays correctly.

So, is there a step that pdftotext is skipping, that it could be doing to get
the proper result or is it something that only works during on-screen rendering
?</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>