<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=102651">102651</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>pdftotext converts all non-breaking spaces U+A0 and U+202F into U+20
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>poppler
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>medium
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>utils
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>poppler-bugs@lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>daniel.flipo@free.fr
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=134154" name="attach_134154" title="PDF file with non-breaking spaces to be preserved">attachment 134154</a> <a href="attachment.cgi?id=134154&action=edit" title="PDF file with non-breaking spaces to be preserved">[details]</a></span>
PDF file with non-breaking spaces to be preserved

Correction of <a class="bz_bug_link 
          bz_status_RESOLVED  bz_closed"
   title="RESOLVED FIXED - No word splitting for pdfs produced by Chrome"
   href="show_bug.cgi?id=97399">bug #97399</a> lead to add non-breaking spaces U+A0 and U+202F to
function UnicodeIsWhitespace which holds the list of all spaces used to break
lines into words.

As a result, these non-breaking spaces are converted into breakable U+20 spaces
by  pdftotext. In some cases (ties like Mr Bean, high punctuation in French,
etc.) these non-breaking spaces are intentionally added and should be preserved
as such in the text or html output.

An option to pdftotext enabling to remove these two spaces from
UnicodeIsWhitespace would solve the issue.

I append a a small PDF file with those non-breaking spaces for testing.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>