[poppler] Comparing geometric layout information across "pages"

Josh Richardson jric at chegg.com
Tue Oct 11 21:23:42 PDT 2011


Hmm.  MuPDF, bless their hearts, is a cool bit of tech, but MUCH less
sophisticated than Poppler.  If I found the right project, pdfdraw is no
exception -- a very small piece of code that doesn't do any structure
analysis; it looks like it just spits out whatever blobs are natively in
the PDF.  If you find that I'm wrong about that, please let me know.

If you start with Poppler, and my version of pdftohtml in particular, then
you at least start out with a notion of words, lines of text, and
paragraphs -- albeit that these things are not very accurate.  Each of
those entities is tagged with font size and style.  You also get bounding
boxes on all that text, as well as image objects (coalesced from multiple
draw operations,) which I use to find the page margins, but can be
extended to find some of the other items you're interested in finding.

Best, --josh

On 10/11/11 9:08 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:

>Thanks Josh, I was actually researching quite heavily, and found
>myself on the #ghostscript channel @ freenode
>
>They pointed me to MuPDF (one of there projects), and it seems like
>the "pdfdraw" example project is something to work from, either
>directly; or through parsing XML output from it.
>
>However, if this doesn't suit your needs, please tell me why, as I
>might have the same problem, and then I'll join forces! :]
>
>On Wed, Oct 12, 2011 at 3:44 AM, Josh Richardson <jric at chegg.com> wrote:
>> Thanks for the pointer, Glad.
>>
>> FYI, I am also interested in being able to analyze document structure.
>> Our first step is to put the text back together, since in many PDFs, it
>>is
>> not logically organized in the original PDF.  pdf2html has a "coalesce"
>> function which is the starting point for us.  We have made some
>> improvements on it which are not yet contributed back -- so let me know
>>if
>> you want the source and/or if you want to join forces.
>>
>> --josh
>>
>> On 10/11/11 12:31 AM, "Glad Deschrijver" <glad.deschrijver at gmail.com>
>> wrote:
>>
>>>On Tuesday 11 October 2011, Alec Taylor wrote:
>>>> Good afternoon,
>>>>
>>>> Do you have some recommends and/or sample code for comparing textual
>>>> and geometric layout information across pages?
>>>>
>>>> Basically I'm trying to realise patterns within documents, e.g., page
>>>> numbers, header and footers, title, column information &etc; using the
>>>> capabilities of the Poppler PDF library.
>>>
>>>Not sure that it will help you much, but you can have a look at DiffPDF
>>>which
>>>uses poppler to compare two PDF files page by page (both textually and
>>>visually):
>>>http://www.qtrac.eu/diffpdf.html
>>>
>>>Best regards,
>>>Glad
>>>
>>>--
>>> Everything that is really great and inspiring is created by
>>> the individual who can labor in freedom.
>>>      -- Albert Einstein, Out of My Later Years (1950)
>>>
>>>_______________________________________________
>>>poppler mailing list
>>>poppler at lists.freedesktop.org
>>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
>>
>



More information about the poppler mailing list