[poppler] pdf to xml update

Thu Sep 15 08:43:08 PDT 2011

Josh Richardson <jric <at> chegg.com> writes:

> 
> 1. I'd like to point out that pdftohtml also has a "coalescence" function
> which attempts to make paragraphs out of PDF, but is so far very
> rudimentary and inaccurate, and could definitely benefit from some good
> algorithmic sauce.  Perhaps we could figure out how to create functions at
> the poppler library level to be leveraged across applications.  I'd be
> happy to contribute.
> 2. Dave, why do you say that you cannot read unicode, and you want 8-bit
> in plain English?  Unicode is great for describing English, as well as
> every other human language.  ASCII is encoded in 7 bits, and once you get
> into that eighth bit, you better know what the encoding is, otherwise you
> may misinterpret the meaning.  What exactly is the problem you're facing?
> For pdftohtml, we found that many documents were encoded with glyphs from
> embedded fonts that had no unicode mapping.  If you need to be able to
> interpret that text without reference to the embedded font, then I think
> you'll have to do pattern-matching on the rendered glyph.  Not something
> I'm planning to undertake, but sounds like fun!
> 
> --josh
> 

HI Josh thanks for your reply

  In the file Gfx I read the commands and I have access to the string of
character directly from those commands, the text is a parameter, of TJ or Tj,
since all the pieces of text from the same paragraph are always between BT
(begin text) and ET (end text) I can correctly extract the whole paragraph, so i
dont need to made any guess or more complex process. The problem with this way
is, sometimes instead of letters, I got some weird stuffs (it prints like a 2x2
table with numbers), but if instead of extract the text from the commands I
extract it before rendering (which is what most of people do) I can actually
read the string of characters, so my question is, im not sure what is the piece
of code that made the translation, So far I also made some heuristics to
separate paragraphs, it works most of the time, but not always, but i think if i
can find a way to translate the other code then i will have something that works
all the time.