[poppler] pdf to xml update

Josh Richardson jric at chegg.com
Thu Sep 15 10:56:19 PDT 2011


When the font defines the unicode mapping, I think you can use these
functions:
 GfxFont::hasToUnicodeCMap()
 GfxFont::getToUnicode()

I recall that Poppler does some smarter things to try and guess the
unicode mapping from the glyph name when the mapping is not provided.  It
implements an Adobe spec, as well as doing some of it's own conjuring.
Look into the GfxFont implementations, e.g. Gfx8BitFont.

--josh

On 9/15/11 8:43 AM, "Dave" <ldlbad at hotmail.com> wrote:

>Josh Richardson <jric <at> chegg.com> writes:
>
>> 
>> 1. I'd like to point out that pdftohtml also has a "coalescence"
>>function
>> which attempts to make paragraphs out of PDF, but is so far very
>> rudimentary and inaccurate, and could definitely benefit from some good
>> algorithmic sauce.  Perhaps we could figure out how to create functions
>>at
>> the poppler library level to be leveraged across applications.  I'd be
>> happy to contribute.
>> 2. Dave, why do you say that you cannot read unicode, and you want 8-bit
>> in plain English?  Unicode is great for describing English, as well as
>> every other human language.  ASCII is encoded in 7 bits, and once you
>>get
>> into that eighth bit, you better know what the encoding is, otherwise
>>you
>> may misinterpret the meaning.  What exactly is the problem you're
>>facing?
>> For pdftohtml, we found that many documents were encoded with glyphs
>>from
>> embedded fonts that had no unicode mapping.  If you need to be able to
>> interpret that text without reference to the embedded font, then I think
>> you'll have to do pattern-matching on the rendered glyph.  Not something
>> I'm planning to undertake, but sounds like fun!
>> 
>> --josh
>> 
>
>HI Josh thanks for your reply
>
>  In the file Gfx I read the commands and I have access to the string of
>character directly from those commands, the text is a parameter, of TJ or
>Tj,
>since all the pieces of text from the same paragraph are always between BT
>(begin text) and ET (end text) I can correctly extract the whole
>paragraph, so i
>dont need to made any guess or more complex process. The problem with
>this way
>is, sometimes instead of letters, I got some weird stuffs (it prints like
>a 2x2
>table with numbers), but if instead of extract the text from the commands
>I
>extract it before rendering (which is what most of people do) I can
>actually
>read the string of characters, so my question is, im not sure what is the
>piece
>of code that made the translation, So far I also made some heuristics to
>separate paragraphs, it works most of the time, but not always, but i
>think if i
>can find a way to translate the other code then i will have something
>that works
>all the time.
>
>
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler
>



More information about the poppler mailing list