[poppler] Comparing geometric layout information across "pages"

Alec Taylor alec.taylor6 at gmail.com
Wed Oct 12 03:03:42 PDT 2011


I can get bounding boxes?

SOLD! - I'll start using your product now :]

On Wed, Oct 12, 2011 at 3:23 PM, Josh Richardson <jric at chegg.com> wrote:
> Hmm.  MuPDF, bless their hearts, is a cool bit of tech, but MUCH less
> sophisticated than Poppler.  If I found the right project, pdfdraw is no
> exception -- a very small piece of code that doesn't do any structure
> analysis; it looks like it just spits out whatever blobs are natively in
> the PDF.  If you find that I'm wrong about that, please let me know.
>
> If you start with Poppler, and my version of pdftohtml in particular, then
> you at least start out with a notion of words, lines of text, and
> paragraphs -- albeit that these things are not very accurate.  Each of
> those entities is tagged with font size and style.  You also get bounding
> boxes on all that text, as well as image objects (coalesced from multiple
> draw operations,) which I use to find the page margins, but can be
> extended to find some of the other items you're interested in finding.
>
> Best, --josh
>
> On 10/11/11 9:08 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>
>>Thanks Josh, I was actually researching quite heavily, and found
>>myself on the #ghostscript channel @ freenode
>>
>>They pointed me to MuPDF (one of there projects), and it seems like
>>the "pdfdraw" example project is something to work from, either
>>directly; or through parsing XML output from it.
>>
>>However, if this doesn't suit your needs, please tell me why, as I
>>might have the same problem, and then I'll join forces! :]
>>
>>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <jric at chegg.com> wrote:
>>> Thanks for the pointer, Glad.
>>>
>>> FYI, I am also interested in being able to analyze document structure.
>>> Our first step is to put the text back together, since in many PDFs, it
>>>is
>>> not logically organized in the original PDF.  pdf2html has a "coalesce"
>>> function which is the starting point for us.  We have made some
>>> improvements on it which are not yet contributed back -- so let me know
>>>if
>>> you want the source and/or if you want to join forces.
>>>
>>> --josh
>>>
>>> On 10/11/11 12:31 AM, "Glad Deschrijver" <glad.deschrijver at gmail.com>
>>> wrote:
>>>
>>>>On Tuesday 11 October 2011, Alec Taylor wrote:
>>>>> Good afternoon,
>>>>>
>>>>> Do you have some recommends and/or sample code for comparing textual
>>>>> and geometric layout information across pages?
>>>>>
>>>>> Basically I'm trying to realise patterns within documents, e.g., page
>>>>> numbers, header and footers, title, column information &etc; using the
>>>>> capabilities of the Poppler PDF library.
>>>>
>>>>Not sure that it will help you much, but you can have a look at DiffPDF
>>>>which
>>>>uses poppler to compare two PDF files page by page (both textually and
>>>>visually):
>>>>http://www.qtrac.eu/diffpdf.html
>>>>
>>>>Best regards,
>>>>Glad
>>>>
>>>>--
>>>> Everything that is really great and inspiring is created by
>>>> the individual who can labor in freedom.
>>>>      -- Albert Einstein, Out of My Later Years (1950)
>>>>
>>>>_______________________________________________
>>>>poppler mailing list
>>>>poppler at lists.freedesktop.org
>>>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>
>>>
>>
>
>


More information about the poppler mailing list