[poppler] pdf to xml update

Wed Sep 14 20:12:45 PDT 2011

Jauco Noordzij <jauco <at> jauco.nl> writes:

> 
> 
> 
> Hi Jauco,Sorry for the late reply, I just read through your mail and your blog
> 
> and it looks like you've done some great work.  From the screenshotsyour text
flow analysis looks really good, and my first though wasthat this will probably
also be useful for text selection in pdfviewers.  The text flow analysis in 
> TextOutputDev.cc is easilyconfused, which leads to weird behavior during
selection, where theselection will jump around and suddenly include unrelated
blocks oftext from across the page(
> 
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=4006).
> hehehe,
> I'm not the fastest replyer myself...  sorry about that, I had a few
> weeks of extreme busyness combined with extreme tiredness/lazyness. But
> I'm ready to get rocking again :)
> 
> 
> I'm thinking that your text flow analysis is a bit more robust and if
> we could use this as the basis for text selection too, we'd have a
> much better story there.  I don't know how much time you have to workon this
now, but if you could split the text flow analysis from theabi word xml output,
that would be great.  Ideally, we keep the flow
> 
> analysis in poppler core (
> i.e. in the poppler/  dir) and refactor thecode to build up a data structure
that represents the text flow(basically, just like TextOutputDev.cc does
it).  Then the abiwordoutput tool just traverses this data structure and output
the xml
> document.  That way the libxml dependency also moves to the abiwordtool
instead of making libpoppler depend on it.  Once that's in place,I'd like to
revisit the poppler selection code and see if I can make
> 
> 
> it use your text flow analysis.
> I'm
> ok with dropping the dependency, but: My code works by constructing a
> tree based on x,y coordinates and then interpreting this tree as a
> reading order list of paragraphs. The construction of the tree is done
> in such a way that a flattened tree will be in correct reading order.
> If you only want a long string of text in correct order this might be
> doable without constructing the tree. I would need to take a good look
> at how it is done now to be sure.
> Without the tree there will be no way to define paragraph endings
> and other stuff I need for the structured text creation though. So that
> leaves two possibilities: Writing code to maintain a tree with
> attributes inside poppler or duplicating the code to the selection part
> and rewriting it there to create a flat list. I'm not a great fan of
> writing my own code to duplicate libxml functionality, I'll doubtlessly
> introduce new bugs and I have to serialise to xml eventually anyway. Anyway,
great you like it! I'm finishing my internship ATM
> and doing some other assignments but I'm determined to get this code
> fixed for inclusion into poppler. Getting the text selection fixed
> would be scratching a major itch as well. (I need to copy-paste from
> pdf's a _lot_ :) So let me no which direction you think is the best for
> poppler as a whole.
> 
> 
> 
> -- greetings,     Jauco Noordzij
> 
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler <at> lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 

Hi All

  I also have the same problem i need to separate the text from different
paragraphs, so far I modify the code in GFX, I create paragraphs with all the
text (text is defined by commands TJ and Tj) between BT and ET, so far the code
works very well, as i said before its deterministic so we can always be sure
that we get the correct paragraph, BUT I have a problem, when the code uses
unicode characters then I cannot read the text :-( Im tracing the code to see if
I can change have access to the unicode characters in 8 bits, but im having
problems in that part, if somebody can please tell me if there is a way to
translate the unicode characters to plain english I can share my code to extract
paragraphs, 

thanks