[Poppler-bugs] [Bug 47022] pdftohtml: control over word breaks

Sun Mar 11 17:21:38 PDT 2012

https://bugs.freedesktop.org/show_bug.cgi?id=47022

--- Comment #3 from Ihar Filipau <thephilips at gmail.com> 2012-03-11 17:21:38 PDT ---
(In reply to comment #2)
> Question, does pdftotext extract the text of those pdf you need to "tweak"
> correctly?
> If it does instead of doing this hack should you try to use the same algorithm
> pdftotext uses?

No, it doesn't extract the text correctly.

pdftotext uses the same 0.1 coefficient. (See the TextOutputDev.cc, define
minWordBreakSpace.)

It seems pretty much everything else uses the same coefficient too. E.g. I
can't search for the split word neither in Adobe Reader nor FoxIt nor Okular.

Worth repeating: the PDF I have is effectively broken. But repairing it
manually is literally impossible. With the switch I have added, `pdftohtml -xml
-wbt 30` repairs literally all words. OK, it incorrectly also glued together
few words - but in the main body of the book's text I couldn't find oddities
anymore.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.