[Poppler-bugs] [Bug 47022] New: pdftohtml: control over word breaks
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Tue Mar 6 06:45:36 PST 2012
https://bugs.freedesktop.org/show_bug.cgi?id=47022
Bug #: 47022
Summary: pdftohtml: control over word breaks
Classification: Unclassified
Product: poppler
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: minor
Priority: medium
Component: pdftohtml
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: thephilips at gmail.com
At the moment poppler's pdftohtml, like inherited from the Xpdf, uses the
following formula to identify the word break (inside the HtmlOutputDev.cc,
search for "0.1"):
: fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr->yMax - curStr->yMin)
I had to convert recently a PDF (as produced by WinWord, obviously) where
kerning/whatever went really wrong and lots of words (about 2.9K of them) were
rendered by the pdftohtml split. E.g. (German text) "auf" became "au f",
"rechte" became "recht e" and so on.
Please provide a command line option to control the behavior of the word
breaking.
As I have understood, there is not much what can be done - except only allowing
the adjustment of the factor used - 0.1. If I have understood correctly the
meaning: word break if distance between characters is more than 10% of
character's height. In my case, setting it higher to e.g. 0.15 or 0.2 could
have allowed me to workaround the bad kerning/etc and reduce the amount of work
to be done in the editing later.
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list