[poppler] [patch] control of word breaks for PDF, request for comments

Mon Mar 12 15:45:52 PDT 2012

Hi All!

Some time ago I have encountered a document (can be provided privately
via e-mail on request) which had a strange problem: spacing between
some characters in words was uneven. Some sort of broken kerning or
some such.

When I tried to convert the PDF into HTML/XML, I have noticed that the
extra distance between characters was causing pretty much all PDF
conversion and reading tools to not recognize the words as words - but
instead as two or more words.

I have spent several weeks trying to salvage the document and I've
done it. Result of the work led me to try to hack on the poppler and
see if I can make that task somehow easier. The result is the pretty
simple patch for pdftohtml attached to the bug:

https://bugs.freedesktop.org/show_bug.cgi?id=47022

It introduces a command line option to adjust the normally hard coded
the coefficient 0.1 used to detect when word break should occur.

One one side, the patch somehow doesn't fit the whole picture:
apparently pretty much all tools use the 0.1 coefficient for breaking
up words.

On the other side, it would be IMO good to have at least one tool
capable of salvaging such documents. The alternative is lengthy and
tedious menial proof-reading and editing.

Does the community have any opinion on the topic in general or the
patch in particular?

Thanks.