[poppler] pdf to xml update

Mon Oct 9 01:05:34 PDT 2006

> Hi Jauco,
>
> Sorry for the late reply, I just read through your mail and your blog
> and it looks like you've done some great work.  From the screenshots
> your text flow analysis looks really good, and my first though was
> that this will probably also be useful for text selection in pdf
> viewers.  The text flow analysis in TextOutputDev.cc is easily
> confused, which leads to weird behavior during selection, where the
> selection will jump around and suddenly include unrelated blocks of
> text from across the page
> ( https://bugs.freedesktop.org/show_bug.cgi?id=4006).

hehehe, I'm not the fastest replyer myself...  sorry about that, I had a few
weeks of extreme busyness combined with extreme tiredness/lazyness. But I'm
ready to get rocking again :)

I'm thinking that your text flow analysis is a bit more robust and if
> we could use this as the basis for text selection too, we'd have a
> much better story there.  I don't know how much time you have to work
> on this now, but if you could split the text flow analysis from the
> abi word xml output, that would be great.  Ideally, we keep the flow
> analysis in poppler core ( i.e. in the poppler/  dir) and refactor the
> code to build up a data structure that represents the text flow
> (basically, just like TextOutputDev.cc does it).  Then the abiword
> output tool just traverses this data structure and output the xml
> document.  That way the libxml dependency also moves to the abiword
> tool instead of making libpoppler depend on it.  Once that's in place,
> I'd like to revisit the poppler selection code and see if I can make
> it use your text flow analysis.

I'm ok with dropping the dependency, but: My code works by constructing a
tree based on x,y coordinates and then interpreting this tree as a reading
order list of paragraphs. The construction of the tree is done in such a way
that a flattened tree will be in correct reading order. If you only want a
long string of text in correct order this might be doable without
constructing the tree. I would need to take a good look at how it is done
now to be sure.
Without the tree there will be no way to define paragraph endings and other
stuff I need for the structured text creation though. So that leaves two
possibilities: Writing code to maintain a tree with attributes inside
poppler or duplicating the code to the selection part and rewriting it there
to create a flat list. I'm not a great fan of writing my own code to
duplicate libxml functionality, I'll doubtlessly introduce new bugs and I
have to serialise to xml eventually anyway.

Anyway, great you like it! I'm finishing my internship ATM and doing some
other assignments but I'm determined to get this code fixed for inclusion
into poppler. Getting the text selection fixed would be scratching a major
itch as well. (I need to copy-paste from pdf's a _lot_ :) So let me no which
direction you think is the best for poppler as a whole.

-- 
greetings,
     Jauco Noordzij
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20061009/32c3513d/attachment-0001.html