[poppler] [patch] control of word breaks for PDF, request for comments

Mon Mar 12 18:29:54 PDT 2012

Thanks for the comment.

I have probably said it somewhere already, but IMO somebody
knowledgeable in the poppler/Xpdf/PDF should try to first create an
infrastructure to reuse the word break mechanism. And the text
coalescing (which is, if calibre trouble tickets is any indication, is
more sensitive area (*)). It is being reimplemented - and every time
slightly differently - in literally every OutputDev. It is probably
the time not spent best, unless of course it helps in the end to
figure out the requirements for such word breaking/character
coalescing utility.

Such change is unfortunately beyond my knowledge in the matters: I'm
not much versed in the PDF structure and most of the GfxState is still
deep mystery to me.

(*) One of the /popular/ problems is (or was? hard to track the
progress) that at times pdftohtml, during (I presume) coalescing,
thinks that in "ll" (two lower case Ls) the second 'l' is in fact a
shadow/etc of the first, and "merges" them by removing the second.
E.g. after conversion people end up in text with "tel" instead of
"tell".

On 3/12/12, Josh Richardson <jric at chegg.com> wrote:
> Given the simplicity of the fix, I think this is a great addition.
>
> FWIW, there are other things we can do to correctly identify spaces
> without trial and error -- for instance, the threshold should not be
> fixed, but should depend upon the current character spacing, word-spacing,
> and width of the "space" character (if present in the selected font.)  My
> colleagues and I have had some success automating using some of these
> metrics.
>
> --josh
>
> On 3/12/12 3:45 PM, "Ihar `Philips` Filipau" <thephilips at gmail.com> wrote:
>
>>Hi All!
>>
>>Some time ago I have encountered a document (can be provided privately
>>via e-mail on request) which had a strange problem: spacing between
>>some characters in words was uneven. Some sort of broken kerning or
>>some such.
>>
>>When I tried to convert the PDF into HTML/XML, I have noticed that the
>>extra distance between characters was causing pretty much all PDF
>>conversion and reading tools to not recognize the words as words - but
>>instead as two or more words.
>>
>>I have spent several weeks trying to salvage the document and I've
>>done it. Result of the work led me to try to hack on the poppler and
>>see if I can make that task somehow easier. The result is the pretty
>>simple patch for pdftohtml attached to the bug:
>>
>>https://bugs.freedesktop.org/show_bug.cgi?id=47022
>>
>>It introduces a command line option to adjust the normally hard coded
>>the coefficient 0.1 used to detect when word break should occur.
>>
>>One one side, the patch somehow doesn't fit the whole picture:
>>apparently pretty much all tools use the 0.1 coefficient for breaking
>>up words.
>>
>>On the other side, it would be IMO good to have at least one tool
>>capable of salvaging such documents. The alternative is lengthy and
>>tedious menial proof-reading and editing.
>>
>>Does the community have any opinion on the topic in general or the
>>patch in particular?
>>
>>Thanks.
>>_______________________________________________
>>poppler mailing list
>>poppler at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)