[poppler] Comparing geometric layout information across "pages"

Alec Taylor alec.taylor6 at gmail.com
Wed Oct 12 03:20:14 PDT 2011


Don't get me wrong, I know what they are, just happy that the tool
support them "out of the box" for PDFs

On Wed, Oct 12, 2011 at 9:17 PM, Albert Astals Cid <aacid at kde.org> wrote:
> A Dimecres, 12 d'octubre de 2011, Alec Taylor vàreu escriure:
>> I can get bounding boxes?
>
> /me points to the various getBBox functions in TextOutputDev.h or to the
> TextBox class in the Qt4
>
> Albert
>
>>
>> SOLD! - I'll start using your product now :]
>>
>> On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <jric at chegg.com> wrote:
>> > Hmm.  MuPDF, bless their hearts, is a cool bit of tech, but MUCH less
>> > sophisticated than Poppler.  If I found the right project, pdfdraw is no
>> > exception -- a very small piece of code that doesn't do any structure
>> > analysis; it looks like it just spits out whatever blobs are natively in
>> > the PDF.  If you find that I'm wrong about that, please let me know.
>> >
>> > If you start with Poppler, and my version of pdftohtml in particular,
>> > then you at least start out with a notion of words, lines of text, and
>> > paragraphs -- albeit that these things are not very accurate.  Each of
>> > those entities is tagged with font size and style.  You also get
>> > bounding boxes on all that text, as well as image objects (coalesced
>> > from multiple draw operations,) which I use to find the page margins,
>> > but can be extended to find some of the other items you're interested
>> > in finding.
>> >
>> > Best, --josh
>> >
>> > On 10/11/11 9:08 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>> >>Thanks Josh, I was actually researching quite heavily, and found
>> >>myself on the #ghostscript channel @ freenode
>> >>
>> >>They pointed me to MuPDF (one of there projects), and it seems like
>> >>the "pdfdraw" example project is something to work from, either
>> >>directly; or through parsing XML output from it.
>> >>
>> >>However, if this doesn't suit your needs, please tell me why, as I
>> >>might have the same problem, and then I'll join forces! :]
>> >>
>> >>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <jric at chegg.com> wrote:
>> >>> Thanks for the pointer, Glad.
>> >>>
>> >>> FYI, I am also interested in being able to analyze document
>> >>> structure.
>> >>> Our first step is to put the text back together, since in many PDFs,
>> >>> it
>> >>>
>> >>>is
>> >>>
>> >>> not logically organized in the original PDF.  pdf2html has a
>> >>> "coalesce"
>> >>> function which is the starting point for us.  We have made some
>> >>> improvements on it which are not yet contributed back -- so let me
>> >>> know
>> >>>
>> >>>if
>> >>>
>> >>> you want the source and/or if you want to join forces.
>> >>>
>> >>> --josh
>> >>>
>> >>> On 10/11/11 12:31 AM, "Glad Deschrijver"
>> >>> <glad.deschrijver at gmail.com>
>> >>>
>> >>> wrote:
>> >>>>On Tuesday 11 October 2011, Alec Taylor wrote:
>> >>>>> Good afternoon,
>> >>>>>
>> >>>>> Do you have some recommends and/or sample code for comparing
>> >>>>> textual
>> >>>>> and geometric layout information across pages?
>> >>>>>
>> >>>>> Basically I'm trying to realise patterns within documents, e.g.,
>> >>>>> page
>> >>>>> numbers, header and footers, title, column information &etc;
>> >>>>> using the capabilities of the Poppler PDF library.
>> >>>>
>> >>>>Not sure that it will help you much, but you can have a look at
>> >>>>DiffPDF
>> >>>>which
>> >>>>uses poppler to compare two PDF files page by page (both textually
>> >>>>and
>> >>>>visually):
>> >>>>http://www.qtrac.eu/diffpdf.html
>> >>>>
>> >>>>Best regards,
>> >>>>Glad
>> >>>>
>> >>>>--
>> >>>>
>> >>>> Everything that is really great and inspiring is created by
>> >>>> the individual who can labor in freedom.
>> >>>>      -- Albert Einstein, Out of My Later Years (1950)
>> >>>>
>> >>>>_______________________________________________
>> >>>>poppler mailing list
>> >>>>poppler at lists.freedesktop.org
>> >>>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


More information about the poppler mailing list