<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body><table border="1" cellspacing="0" cellpadding="8"> <tr> <th>Bug ID</th> <td><a class="bz_bug_link bz_status_NEW " title="NEW - Spurious whitespace added after an "ActualText" segment" href="https://bugs.freedesktop.org/show_bug.cgi?id=106312">106312</a> </td> </tr> <tr> <th>Summary</th> <td>Spurious whitespace added after an "ActualText" segment </td> </tr> <tr> <th>Product</th> <td>poppler </td> </tr> <tr> <th>Version</th> <td>unspecified </td> </tr> <tr> <th>Hardware</th> <td>All </td> </tr> <tr> <th>OS</th> <td>Linux (All) </td> </tr> <tr> <th>Status</th> <td>NEW </td> </tr> <tr> <th>Severity</th> <td>normal </td> </tr> <tr> <th>Priority</th> <td>medium </td> </tr> <tr> <th>Component</th> <td>general </td> </tr> <tr> <th>Assignee</th> <td>poppler-bugs@lists.freedesktop.org </td> </tr> <tr> <th>Reporter</th> <td>michaelnm.meyer@gmail.com </td> </tr></table> <p> <div> <pre>Created <span class=""><a href="attachment.cgi?id=139219" name="attach_139219" title="Sample PDF">attachment 139219</a> <a href="attachment.cgi?id=139219&action=edit" title="Sample PDF">[details]</a></span> Sample PDF The attached PDF file contains two times the same string "aṭa", in a regular font and in an italic font, respectively. In both cases, the dot below "t" is rendered with an IPA font, and the resulting character is overlayed with the corresponding code point (U+1E6D) as "ActualText". Now, extracting the PDF text with "pdftotext" (or copy-pasting the text from a PDF viewer that uses Poppler) results in the string "aṭa aṭ a" instead of the expected "aṭa aṭa". Both Acrobat Reader and Google Chrome's builtin PDF viewer correctly produce the string "aṭa aṭa". Looking at Poppler's code, it looks like the culprit is the following check in "poppler/TextOutputDev.cc": if (overlap || lastCharOverlap || sp < -minDupBreakOverlap * curWord->fontSize || sp > minWordBreakSpace * curWord->fontSize || // PROBLEM HERE fabs(base - curWord->base) > 0.5 || curFontSize != curWord->fontSize || wMode != curWord->wMode ) { endWord(); } Slightly increasing the value of "minWordBreakSpace" produces the expected result. This makes me think that "curWord->fontSize" is not computed properly for the italic font. The attached PDF file was produced with the following latex code (to be compiled with lualatex): \documentclass[12pt]{article} \usepackage{newunicodechar} \usepackage[luatex]{accsupp} \usepackage{tipa} \newunicodechar{ṭ}{% \BeginAccSupp{% method=hex,% unicode=true,% ActualText=1e6d,% }% \textsubdot{t}% \EndAccSupp{}% } \begin{document} \thispagestyle{empty} aṭa \textit{aṭa} \end{document}</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>