[poppler] pdf to xml update

Josh Richardson jric at chegg.com
Wed Sep 14 20:30:18 PDT 2011


1. I'd like to point out that pdftohtml also has a "coalescence" function
which attempts to make paragraphs out of PDF, but is so far very
rudimentary and inaccurate, and could definitely benefit from some good
algorithmic sauce.  Perhaps we could figure out how to create functions at
the poppler library level to be leveraged across applications.  I'd be
happy to contribute.
2. Dave, why do you say that you cannot read unicode, and you want 8-bit
in plain English?  Unicode is great for describing English, as well as
every other human language.  ASCII is encoded in 7 bits, and once you get
into that eighth bit, you better know what the encoding is, otherwise you
may misinterpret the meaning.  What exactly is the problem you're facing?
For pdftohtml, we found that many documents were encoded with glyphs from
embedded fonts that had no unicode mapping.  If you need to be able to
interpret that text without reference to the embedded font, then I think
you'll have to do pattern-matching on the rendered glyph.  Not something
I'm planning to undertake, but sounds like fun!

--josh

On 9/14/11 8:12 PM, "Dave" <ldlbad at hotmail.com> wrote:

>
>Jauco Noordzij <jauco <at> jauco.nl> writes:
>
>> 
>> 
>> 
>> Hi Jauco,Sorry for the late reply, I just read through your mail and
>>your blog
>> 
>> and it looks like you've done some great work.  From the
>>screenshotsyour text
>flow analysis looks really good, and my first though wasthat this will
>probably
>also be useful for text selection in pdfviewers.  The text flow analysis
>in 
>> TextOutputDev.cc is easilyconfused, which leads to weird behavior during
>selection, where theselection will jump around and suddenly include
>unrelated
>blocks oftext from across the page(
>> 
>> 
>> https://bugs.freedesktop.org/show_bug.cgi?id=4006).
>> hehehe,
>> I'm not the fastest replyer myself...  sorry about that, I had a few
>> weeks of extreme busyness combined with extreme tiredness/lazyness. But
>> I'm ready to get rocking again :)
>> 
>> 
>> I'm thinking that your text flow analysis is a bit more robust and if
>> we could use this as the basis for text selection too, we'd have a
>> much better story there.  I don't know how much time you have to workon
>>this
>now, but if you could split the text flow analysis from theabi word xml
>output,
>that would be great.  Ideally, we keep the flow
>> 
>> analysis in poppler core (
>> i.e. in the poppler/  dir) and refactor thecode to build up a data
>>structure
>that represents the text flow(basically, just like TextOutputDev.cc does
>it).  Then the abiwordoutput tool just traverses this data structure and
>output
>the xml
>> document.  That way the libxml dependency also moves to the abiwordtool
>instead of making libpoppler depend on it.  Once that's in place,I'd like
>to
>revisit the poppler selection code and see if I can make
>> 
>> 
>> it use your text flow analysis.
>> I'm
>> ok with dropping the dependency, but: My code works by constructing a
>> tree based on x,y coordinates and then interpreting this tree as a
>> reading order list of paragraphs. The construction of the tree is done
>> in such a way that a flattened tree will be in correct reading order.
>> If you only want a long string of text in correct order this might be
>> doable without constructing the tree. I would need to take a good look
>> at how it is done now to be sure.
>> Without the tree there will be no way to define paragraph endings
>> and other stuff I need for the structured text creation though. So that
>> leaves two possibilities: Writing code to maintain a tree with
>> attributes inside poppler or duplicating the code to the selection part
>> and rewriting it there to create a flat list. I'm not a great fan of
>> writing my own code to duplicate libxml functionality, I'll doubtlessly
>> introduce new bugs and I have to serialise to xml eventually anyway.
>>Anyway,
>great you like it! I'm finishing my internship ATM
>> and doing some other assignments but I'm determined to get this code
>> fixed for inclusion into poppler. Getting the text selection fixed
>> would be scratching a major itch as well. (I need to copy-paste from
>> pdf's a _lot_ :) So let me no which direction you think is the best for
>> poppler as a whole.
>> 
>> 
>> 
>> -- greetings,     Jauco Noordzij
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> poppler mailing list
>> poppler <at> lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>> 
>
>
>Hi All
>
>  I also have the same problem i need to separate the text from different
>paragraphs, so far I modify the code in GFX, I create paragraphs with all
>the
>text (text is defined by commands TJ and Tj) between BT and ET, so far
>the code
>works very well, as i said before its deterministic so we can always be
>sure
>that we get the correct paragraph, BUT I have a problem, when the code
>uses
>unicode characters then I cannot read the text :-( Im tracing the code to
>see if
>I can change have access to the unicode characters in 8 bits, but im
>having
>problems in that part, if somebody can please tell me if there is a way to
>translate the unicode characters to plain english I can share my code to
>extract
>paragraphs, 
>
>thanks
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list