[poppler] Analysing 3 pages from pdftohtml -xml at a time

Sun Oct 23 21:20:53 PDT 2011

Thanks Josh, that was my previous plan, but I thought it might be less
efficient than processing as it goes.

But perhaps that's not the case...

On Mon, Oct 24, 2011 at 2:51 PM, Josh Richardson <jric at chegg.com> wrote:
> As I may have mentioned, you may wish to use the complex output from
> pdftohtml rather than the xml output option.  The complex output provides
> more functionality like creating image regions, calculating text bounding
> boxes and putting them into the xml, and font-aware text-coalescing.  It
> also produces valid XML (save one minor bug, which should be easy to fix.)
>
> In response to your particular question, I don't think there is any
> "in-memory" data-structure tracking the pages as they are created.  Since
> you're doing meta page-level processing, maybe you want to do it as a
> final optional stage of processing for pdftohtml, after it loops through
> and creates each page.  It could go back through and read in the DOMs it
> needs, run the calculations, and modify the XML files.
>
> Best, --josh
>
> On 10/22/11 10:04 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>
>>Good morning,
>>
>>I'm trying to figure out how to analyse (in memory) 3 pages from the
>>pdftohtml -xml book.pdf stream, (so before it is written to the
>>book.xml output file).
>>
>>Due to the enhancement I'm implementing onto pdftohtml, my algorithm
>>requires analysis of 3 pages at a time.
>>
>>[p1] R [p2] R [p3]
>>then
>>[p2] R [p3] R [p4]
>>continue till no pages are left
>>
>>(where 'R' refers to the relation I'm running on each page trio)
>>
>>How do I run this relation? - Preferably using some data-structure
>>(i.e. intermediary in-memory XML for analysis with libxml2 libraries)
>>
>>Thanks for all suggestions,
>>
>>Alec Taylor
>>_______________________________________________
>>poppler mailing list
>>poppler at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>