[poppler] Analysing 3 pages from pdftohtml -xml at a time

Josh Richardson jric at chegg.com
Sun Oct 23 20:51:58 PDT 2011


As I may have mentioned, you may wish to use the complex output from
pdftohtml rather than the xml output option.  The complex output provides
more functionality like creating image regions, calculating text bounding
boxes and putting them into the xml, and font-aware text-coalescing.  It
also produces valid XML (save one minor bug, which should be easy to fix.)

In response to your particular question, I don't think there is any
"in-memory" data-structure tracking the pages as they are created.  Since
you're doing meta page-level processing, maybe you want to do it as a
final optional stage of processing for pdftohtml, after it loops through
and creates each page.  It could go back through and read in the DOMs it
needs, run the calculations, and modify the XML files.

Best, --josh

On 10/22/11 10:04 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:

>Good morning,
>
>I'm trying to figure out how to analyse (in memory) 3 pages from the
>pdftohtml -xml book.pdf stream, (so before it is written to the
>book.xml output file).
>
>Due to the enhancement I'm implementing onto pdftohtml, my algorithm
>requires analysis of 3 pages at a time.
>
>[p1] R [p2] R [p3]
>then
>[p2] R [p3] R [p4]
>continue till no pages are left
>
>(where 'R' refers to the relation I'm running on each page trio)
>
>How do I run this relation? - Preferably using some data-structure
>(i.e. intermediary in-memory XML for analysis with libxml2 libraries)
>
>Thanks for all suggestions,
>
>Alec Taylor
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler
>



More information about the poppler mailing list