[poppler] Analysing 3 pages from pdftohtml -xml at a time

Sun Oct 23 21:30:03 PDT 2011

It depends upon what exactly you're doing, and on your use-case, but yes,
processing as you go could be more efficient.  From what I'm imagining,
the efficiency lost may be a worthwhile tradeoff to keep the program
structure simpler.

Best, --josh

On 10/23/11 9:20 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:

>Thanks Josh, that was my previous plan, but I thought it might be less
>efficient than processing as it goes.
>
>But perhaps that's not the case...
>
>On Mon, Oct 24, 2011 at 2:51 PM, Josh Richardson <jric at chegg.com> wrote:
>> As I may have mentioned, you may wish to use the complex output from
>> pdftohtml rather than the xml output option.  The complex output
>>provides
>> more functionality like creating image regions, calculating text
>>bounding
>> boxes and putting them into the xml, and font-aware text-coalescing.  It
>> also produces valid XML (save one minor bug, which should be easy to
>>fix.)
>>
>> In response to your particular question, I don't think there is any
>> "in-memory" data-structure tracking the pages as they are created.
>>Since
>> you're doing meta page-level processing, maybe you want to do it as a
>> final optional stage of processing for pdftohtml, after it loops through
>> and creates each page.  It could go back through and read in the DOMs it
>> needs, run the calculations, and modify the XML files.
>>
>> Best, --josh
>>
>> On 10/22/11 10:04 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>>
>>>Good morning,
>>>
>>>I'm trying to figure out how to analyse (in memory) 3 pages from the
>>>pdftohtml -xml book.pdf stream, (so before it is written to the
>>>book.xml output file).
>>>
>>>Due to the enhancement I'm implementing onto pdftohtml, my algorithm
>>>requires analysis of 3 pages at a time.
>>>
>>>[p1] R [p2] R [p3]
>>>then
>>>[p2] R [p3] R [p4]
>>>continue till no pages are left
>>>
>>>(where 'R' refers to the relation I'm running on each page trio)
>>>
>>>How do I run this relation? - Preferably using some data-structure
>>>(i.e. intermediary in-memory XML for analysis with libxml2 libraries)
>>>
>>>Thanks for all suggestions,
>>>
>>>Alec Taylor
>>>_______________________________________________
>>>poppler mailing list
>>>poppler at lists.freedesktop.org
>>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
>>
>