[Poppler-bugs] [Bug 28052] New: pdftohtml loses some double lls in duplicate check

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon May 10 08:29:38 PDT 2010


https://bugs.freedesktop.org/show_bug.cgi?id=28052

           Summary: pdftohtml loses some double lls in duplicate check
           Product: poppler
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: chris at codealchemy.org


Problem: In some PDF documents two lls will overlap slightly.  pdftohtml will
drop latter l.  E.g., called because cal ed, all becomes al , and eventually
becomes eventual y.

Version: poppler-0.13.3

Reason: In HtmlOutputDev.cc, class HtmlPage, method coalesce, there's a section
of code to discard duplicate text for "fake boldface, drop shadows."  The lls
are triggering the duplicate code and are thus removed from the output.

The debug output shows:
x=139.68000..143.016000  y=626.076000..641.844000  size=15 'l'
x=142.80000..146.136000  y=626.076000..641.844000  size=15 'l'

Due to my inexperience with the project I can't say what the best solution will
be.  Here are a few options I've considered.  If you'd like to suggest a
preferred method for solving this problem I will implement and submit a patch,
however I have no test documents that involve actual duplicate text.

Solution 1: Decrease the fudge factor from 0.2 to 0.1.  This may not be
reliable and could cause the duplicates which this code was originally meant to
discard to resurface.  It will, however, let the lls through in my test
documents.

Solution 2: Make the duplicate check a command-line option.  Documents that
have both lls and duplicate text will still exhibit errors, though.

Solution 3: Use a different algorithm for determining duplicate text.  Perhaps
the dupe check shouldn't drop characters that start more than halfway between
the bounding box of the last character.  In this example, 141.348 is the
halfway point for the first character, and 142.8 is beyond that.  It seems
unlikely for boldface or drop shadows to be so far beyond the starting point of
their host character.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list