<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - pdftocairo -pdf output breaks extracted text"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=106444">106444</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>pdftocairo -pdf output breaks extracted text
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>poppler
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>x86-64 (AMD64)
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux (All)
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>medium
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>utils
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>poppler-bugs@lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>nopbin+freedeskbugs@gmail.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=139431" name="attach_139431" title="PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.">attachment 139431</a> <a href="attachment.cgi?id=139431&action=edit" title="PDFs, original and outputs from Ubuntu and Mac. Extracted text original and Ubuntu optimized.">[details]</a></span>
PDFs, original and outputs from Ubuntu and Mac. Extracted text original and
Ubuntu optimized.

Under Ubuntu 16.04 processing select PDFs with pdftocairo -pdf (both versions
0.41.0 (pkg) and 0.64.0 (src)) results in text extracted from the resulting PDF
to appear as question mark symbols (suggesting a text encoding problem).  The
rendered image output appears correct.

I initially observed the problem with the extracted text when programmatically
processing the text layer when rendered with pdf.js but then confirmed the
behavior looking at the output of pdftotext. (Also when copying text from other
pdf viewers.)

Interestingly when the same PDF is processed on a Mac with pdftocairo (0.64.0)
the output PDFs extracted text appears *correct*.  I am not sure if it is
relevant but in the attached example I do observe some differences in the font
encoding as shown below.


pdffonts from original PDF:

    name                                 type              encoding         emb
sub uni object ID
    ------------------------------------ ----------------- ---------------- ---
--- --- ---------
    FFXDHY+ArialMT                       TrueType          MacRoman         yes
yes no      10  0
    EESSLH+Helvetica                     TrueType          WinAnsi          yes
yes yes      9  0



pdffonts after processing on Ubuntu:

    name                                 type              encoding         emb
sub uni object ID
    ------------------------------------ ----------------- ---------------- ---
--- --- ---------
    DFUWOB+ArialMT                       CID TrueType      Identity-H       yes
yes yes      5  0


pdffonts after processing on Mac:

    name                                 type              encoding         emb
sub uni object ID
    ------------------------------------ ----------------- ---------------- ---
--- --- ---------
    DFUWOB+ArialMT                       TrueType          WinAnsi          yes
yes yes      5  0</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>