[poppler] [patch] control of word breaks for PDF, request for comments

Josh Richardson jric at chegg.com
Mon Mar 12 15:56:35 PDT 2012

Given the simplicity of the fix, I think this is a great addition.

FWIW, there are other things we can do to correctly identify spaces
without trial and error -- for instance, the threshold should not be
fixed, but should depend upon the current character spacing, word-spacing,
and width of the "space" character (if present in the selected font.)  My
colleagues and I have had some success automating using some of these


On 3/12/12 3:45 PM, "Ihar `Philips` Filipau" <thephilips at gmail.com> wrote:

>Hi All!
>Some time ago I have encountered a document (can be provided privately
>via e-mail on request) which had a strange problem: spacing between
>some characters in words was uneven. Some sort of broken kerning or
>some such.
>When I tried to convert the PDF into HTML/XML, I have noticed that the
>extra distance between characters was causing pretty much all PDF
>conversion and reading tools to not recognize the words as words - but
>instead as two or more words.
>I have spent several weeks trying to salvage the document and I've
>done it. Result of the work led me to try to hack on the poppler and
>see if I can make that task somehow easier. The result is the pretty
>simple patch for pdftohtml attached to the bug:
>It introduces a command line option to adjust the normally hard coded
>the coefficient 0.1 used to detect when word break should occur.
>One one side, the patch somehow doesn't fit the whole picture:
>apparently pretty much all tools use the 0.1 coefficient for breaking
>up words.
>On the other side, it would be IMO good to have at least one tool
>capable of salvaging such documents. The alternative is lengthy and
>tedious menial proof-reading and editing.
>Does the community have any opinion on the topic in general or the
>patch in particular?
>poppler mailing list
>poppler at lists.freedesktop.org

More information about the poppler mailing list