<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><span class="vcard"><a class="email" href="mailto:jason@inspiresomeone.us" title="Jason Crain <jason@inspiresomeone.us>"> <span class="fn">Jason Crain</span></a>
</span> changed
          <a class="bz_bug_link 
          bz_status_NEEDINFO "
   title="NEEDINFO - pdftohtml: don't put control characters in output"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=101770">bug 101770</a>
          <br>
             <table border="1" cellspacing="0" cellpadding="8">
          <tr>
            <th>What</th>
            <th>Removed</th>
            <th>Added</th>
          </tr>

         <tr>
           <td style="text-align:right;">Summary</td>
           <td>Is it possible to fix special chars?
           </td>
           <td>pdftohtml: don't put control characters in output
           </td>
         </tr>

         <tr>
           <td style="text-align:right;">Component</td>
           <td>utils
           </td>
           <td>pdftohtml
           </td>
         </tr></table>
      <p>
        <div>
            <b><a class="bz_bug_link 
          bz_status_NEEDINFO "
   title="NEEDINFO - pdftohtml: don't put control characters in output"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=101770#c5">Comment # 5</a>
              on <a class="bz_bug_link 
          bz_status_NEEDINFO "
   title="NEEDINFO - pdftohtml: don't put control characters in output"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=101770">bug 101770</a>
              from <span class="vcard"><a class="email" href="mailto:jason@inspiresomeone.us" title="Jason Crain <jason@inspiresomeone.us>"> <span class="fn">Jason Crain</span></a>
</span></b>
        <pre>I'm not an expert on PHP but it looks like that is calling out to poppler's
pdftohtml and PHP seems to not like control characters in HTML.  I also found
this secion in a W3C working draft:

<a href="https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#text-0">https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#text-0</a>
Text must not contain U+0000 characters. Text must not contain permanently
undefined Unicode characters (noncharacters). Text must not contain control
characters other than space characters.

So pdftohtml should probably not be putting control characters in its output.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>