[Poppler-bugs] [Bug 28052] New: pdftohtml loses some double lls in duplicate check
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Mon May 10 08:29:38 PDT 2010
https://bugs.freedesktop.org/show_bug.cgi?id=28052
Summary: pdftohtml loses some double lls in duplicate check
Product: poppler
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: medium
Component: general
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: chris at codealchemy.org
Problem: In some PDF documents two lls will overlap slightly. pdftohtml will
drop latter l. E.g., called because cal ed, all becomes al , and eventually
becomes eventual y.
Version: poppler-0.13.3
Reason: In HtmlOutputDev.cc, class HtmlPage, method coalesce, there's a section
of code to discard duplicate text for "fake boldface, drop shadows." The lls
are triggering the duplicate code and are thus removed from the output.
The debug output shows:
x=139.68000..143.016000 y=626.076000..641.844000 size=15 'l'
x=142.80000..146.136000 y=626.076000..641.844000 size=15 'l'
Due to my inexperience with the project I can't say what the best solution will
be. Here are a few options I've considered. If you'd like to suggest a
preferred method for solving this problem I will implement and submit a patch,
however I have no test documents that involve actual duplicate text.
Solution 1: Decrease the fudge factor from 0.2 to 0.1. This may not be
reliable and could cause the duplicates which this code was originally meant to
discard to resurface. It will, however, let the lls through in my test
documents.
Solution 2: Make the duplicate check a command-line option. Documents that
have both lls and duplicate text will still exhibit errors, though.
Solution 3: Use a different algorithm for determining duplicate text. Perhaps
the dupe check shouldn't drop characters that start more than halfway between
the bounding box of the last character. In this example, 141.348 is the
halfway point for the first character, and 142.8 is beyond that. It seems
unlikely for boldface or drop shadows to be so far beyond the starting point of
their host character.
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list