[Poppler-bugs] [Bug 47022] New: pdftohtml: control over word breaks

Tue Mar 6 06:45:36 PST 2012

https://bugs.freedesktop.org/show_bug.cgi?id=47022

             Bug #: 47022
           Summary: pdftohtml: control over word breaks
    Classification: Unclassified
           Product: poppler
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: medium
         Component: pdftohtml
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: thephilips at gmail.com

At the moment poppler's pdftohtml, like inherited from the Xpdf, uses the
following formula to identify the word break (inside the HtmlOutputDev.cc,
search for "0.1"):

: fabs(x1 - curStr->xRight[n-1]) > 0.1 * (curStr->yMax - curStr->yMin)

I had to convert recently a PDF (as produced by WinWord, obviously) where
kerning/whatever went really wrong and lots of words (about 2.9K of them) were
rendered by the pdftohtml split. E.g. (German text) "auf" became "au f",
"rechte" became "recht e" and so on.

Please provide a command line option to control the behavior of the word
breaking.

As I have understood, there is not much what can be done - except only allowing
the adjustment of the factor used - 0.1. If I have understood correctly the
meaning: word break if distance between characters is more than 10% of
character's height. In my case, setting it higher to e.g. 0.15 or 0.2 could
have allowed me to workaround the bad kerning/etc and reduce the amount of work
to be done in the editing later.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.