[poppler] pdf to xml update

Leonard Rosenthol leonardr at pdfsages.com
Sun Aug 13 16:37:25 PDT 2006

At 11:31 AM 8/13/2006, Jauco Noordzij wrote:
>It parses the pages, aggregating textblocks in much the same way as
>the current textoutputdev. It then chunks the page into a tree of
>nested 'splits'. ie. The page is split in two, then the two parts are
>split in two etc. This tree is then turned into blocks and paragraphs.
>The process is a bit hard to explain, but works quite well.

         Sounds good.

         One thing I noticed from the blog is that it doesn't (yet) 
support styling information.  You should be able to easily carry this 
along using a similar method to the PdfWord used by the current TextOutputDev.

>I also have a quick question: Is there a callback function for
>outputdevs that gets called at the end of processing the pdf, like the
>one that's called at the end of the page?

         No, because in an interactive application, no such thing 
really exists.  The closest you can come is the destructor for the 
OutputDev.  I've put such things there before.

         Another option is a new method for your callers to use...


Leonard Rosenthol                            <mailto:leonardr at pdfsages.com>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                              215-938-0880 (fax)

More information about the poppler mailing list