[poppler] pdf to xml update

Sun Aug 13 08:31:47 PDT 2006

Hi folks,

I'm still working on the pdf-to-structured-xml outputdev and I have
published a first tryout of a patch at http://jauco.nl/blog/?p=27 . I
am wondering what you guys think of it :)

It parses the pages, aggregating textblocks in much the same way as
the current textoutputdev. It then chunks the page into a tree of
nested 'splits'. ie. The page is split in two, then the two parts are
split in two etc. This tree is then turned into blocks and paragraphs.
The process is a bit hard to explain, but works quite well. If anyone
is really interested I suggest they search for 'recursive XY cut' in
google scholar. The result is a tree that has the text in reading
order (even quite complex layouts) and from that tree the outputdev
can recognise blocks of text and columns.

I also have a quick question: Is there a callback function for
outputdevs that gets called at the end of processing the pdf, like the
one that's called at the end of the page? That would be a nice place
to do some multi-page analysing and add a function to convert the
whole structured tree to

--
Greetings,
Jauco Noordzij