[Poppler-bugs] [Bug 62266] [PATCH] try to detect line breaks in the PDF and insert them in raw mode for pdftotext
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Mon Mar 25 15:19:49 PDT 2013
https://bugs.freedesktop.org/show_bug.cgi?id=62266
--- Comment #11 from Andrew Gallant <jamslam at gmail.com> ---
> It may not, but i don't see the need for your patch (you haven't made a case for it)
My patch is useful when one wants to capture groupings indicated by a
particular amount of vertical white space in raw mode from the PDF. Raw mode is
*already* capturing some kinds of vertical white space.
I've said this a couple of times now, but you don't seem to recognize it as me
having made a case. Perhaps you could tell me what you would need to be
convinced so that I can better make my case?
> In my opinion you are trying to use raworder for something that raworder is not supposed to do
I disagree. If that were so, then I'd be making assumptions about the text in
raw order that the code hasn't already made. But I'm not. It's a tweak on
existing logic that is already assuming some sort of reading order by looking
at letter spacing and intra-line spacing and using that information to affect
the output of raw mode. I propose to also look at inter-line spacing.
> why are you using raw order instead of the real physical order?
Because I want to attempt to extract a linear text stream from a PDF in reading
order. Unless I am mistaken, raw mode seems best suited to do that. The new
option in the patch makes that raw text easier to consume in some cases (just
like adding new lines based on the intra-line spacing also makes it easier to
consume).
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20130325/2605f5f5/attachment.html>
More information about the Poppler-bugs
mailing list