[poppler] Comparing geometric layout information across "pages"

Wed Oct 12 03:17:44 PDT 2011

A Dimecres, 12 d'octubre de 2011, Alec Taylor vàreu escriure:
> I can get bounding boxes?

/me points to the various getBBox functions in TextOutputDev.h or to the 
TextBox class in the Qt4

Albert

> 
> SOLD! - I'll start using your product now :]
> 
> On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <jric at chegg.com> wrote:
> > Hmm.  MuPDF, bless their hearts, is a cool bit of tech, but MUCH less
> > sophisticated than Poppler.  If I found the right project, pdfdraw is no
> > exception -- a very small piece of code that doesn't do any structure
> > analysis; it looks like it just spits out whatever blobs are natively in
> > the PDF.  If you find that I'm wrong about that, please let me know.
> > 
> > If you start with Poppler, and my version of pdftohtml in particular,
> > then you at least start out with a notion of words, lines of text, and
> > paragraphs -- albeit that these things are not very accurate.  Each of
> > those entities is tagged with font size and style.  You also get
> > bounding boxes on all that text, as well as image objects (coalesced
> > from multiple draw operations,) which I use to find the page margins,
> > but can be extended to find some of the other items you're interested
> > in finding.
> > 
> > Best, --josh
> > 
> > On 10/11/11 9:08 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
> >>Thanks Josh, I was actually researching quite heavily, and found
> >>myself on the #ghostscript channel @ freenode
> >>
> >>They pointed me to MuPDF (one of there projects), and it seems like
> >>the "pdfdraw" example project is something to work from, either
> >>directly; or through parsing XML output from it.
> >>
> >>However, if this doesn't suit your needs, please tell me why, as I
> >>might have the same problem, and then I'll join forces! :]
> >>
> >>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <jric at chegg.com> wrote:
> >>> Thanks for the pointer, Glad.
> >>> 
> >>> FYI, I am also interested in being able to analyze document
> >>> structure.
> >>> Our first step is to put the text back together, since in many PDFs,
> >>> it
> >>>
> >>>is
> >>>
> >>> not logically organized in the original PDF.  pdf2html has a
> >>> "coalesce"
> >>> function which is the starting point for us.  We have made some
> >>> improvements on it which are not yet contributed back -- so let me
> >>> know
> >>>
> >>>if
> >>>
> >>> you want the source and/or if you want to join forces.
> >>> 
> >>> --josh
> >>> 
> >>> On 10/11/11 12:31 AM, "Glad Deschrijver"
> >>> <glad.deschrijver at gmail.com>
> >>> 
> >>> wrote:
> >>>>On Tuesday 11 October 2011, Alec Taylor wrote:
> >>>>> Good afternoon,
> >>>>> 
> >>>>> Do you have some recommends and/or sample code for comparing
> >>>>> textual
> >>>>> and geometric layout information across pages?
> >>>>> 
> >>>>> Basically I'm trying to realise patterns within documents, e.g.,
> >>>>> page
> >>>>> numbers, header and footers, title, column information &etc;
> >>>>> using the capabilities of the Poppler PDF library.
> >>>>
> >>>>Not sure that it will help you much, but you can have a look at
> >>>>DiffPDF
> >>>>which
> >>>>uses poppler to compare two PDF files page by page (both textually
> >>>>and
> >>>>visually):
> >>>>http://www.qtrac.eu/diffpdf.html
> >>>>
> >>>>Best regards,
> >>>>Glad
> >>>>
> >>>>--
> >>>>
> >>>> Everything that is really great and inspiring is created by
> >>>> the individual who can labor in freedom.
> >>>>      -- Albert Einstein, Out of My Later Years (1950)
> >>>>
> >>>>_______________________________________________
> >>>>poppler mailing list
> >>>>poppler at lists.freedesktop.org
> >>>>http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler