[Poppler-bugs] [Bug 106312] New: Spurious whitespace added after an "ActualText" segment

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Apr 29 16:37:39 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=106312

            Bug ID: 106312
           Summary: Spurious whitespace added after an "ActualText"
                    segment
           Product: poppler
           Version: unspecified
          Hardware: All
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: michaelnm.meyer at gmail.com

Created attachment 139219
  --> https://bugs.freedesktop.org/attachment.cgi?id=139219&action=edit
Sample PDF

The attached PDF file contains two times the same string "aṭa", in a regular
font and in an italic font, respectively. In both cases, the dot below "t" is
rendered with an IPA font, and the resulting character is overlayed with the
corresponding code point (U+1E6D) as "ActualText".

Now, extracting the PDF text with "pdftotext" (or copy-pasting the text from a
PDF viewer that uses Poppler) results in the string "aṭa aṭ a" instead of the
expected "aṭa aṭa". Both Acrobat Reader and Google Chrome's builtin PDF viewer
correctly produce the string "aṭa aṭa".

Looking at Poppler's code, it looks like the culprit is the following check in
"poppler/TextOutputDev.cc":

    if (overlap || lastCharOverlap ||
        sp < -minDupBreakOverlap * curWord->fontSize ||
        sp > minWordBreakSpace * curWord->fontSize || // PROBLEM HERE
        fabs(base - curWord->base) > 0.5 ||
        curFontSize != curWord->fontSize ||
        wMode != curWord->wMode
        ) {
      endWord();
    }

Slightly increasing the value of "minWordBreakSpace" produces the expected
result. This makes me think that "curWord->fontSize" is not computed properly
for the italic font.

The attached PDF file was produced with the following latex code (to be
compiled with lualatex):

   \documentclass[12pt]{article}

   \usepackage{newunicodechar}
   \usepackage[luatex]{accsupp}
   \usepackage{tipa}

   \newunicodechar{ṭ}{%
      \BeginAccSupp{%
         method=hex,%
         unicode=true,%
         ActualText=1e6d,%
      }%
      \textsubdot{t}%
      \EndAccSupp{}%
   }

   \begin{document}
   \thispagestyle{empty}
   aṭa \textit{aṭa}
   \end{document}

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20180429/129a1e13/attachment.html>


More information about the Poppler-bugs mailing list