[poppler] pdf to xml update

Kristian Høgsberg krh at bitplanet.net
Wed Sep 13 09:13:21 PDT 2006


On 8/13/06, Jauco Noordzij <jauco at jauco.nl> wrote:
> Hi folks,
>
> I'm still working on the pdf-to-structured-xml outputdev and I have
> published a first tryout of a patch at http://jauco.nl/blog/?p=27 . I
> am wondering what you guys think of it :)

Hi Jauco,

Sorry for the late reply, I just read through your mail and your blog
and it looks like you've done some great work.  From the screenshots
your text flow analysis looks really good, and my first though was
that this will probably also be useful for text selection in pdf
viewers.  The text flow analysis in TextOutputDev.cc is easily
confused, which leads to weird behavior during selection, where the
selection will jump around and suddenly include unrelated blocks of
text from across the page
(https://bugs.freedesktop.org/show_bug.cgi?id=4006).

I'm thinking that your text flow analysis is a bit more robust and if
we could use this as the basis for text selection too, we'd have a
much better story there.  I don't know how much time you have to work
on this now, but if you could split the text flow analysis from the
abi word xml output, that would be great.  Ideally, we keep the flow
analysis in poppler core (i.e. in the poppler/  dir) and refactor the
code to build up a data structure that represents the text flow
(basically, just like TextOutputDev.cc does it).  Then the abiword
output tool just traverses this data structure and output the xml
document.  That way the libxml dependency also moves to the abiword
tool instead of making libpoppler depend on it.  Once that's in place,
I'd like to revisit the poppler selection code and see if I can make
it use your text flow analysis.

Thanks, and again, great work!
Kristian


More information about the poppler mailing list