[poppler] pdf to xml update

Leonard Rosenthol lrosenth at adobe.com
Thu Sep 15 12:50:07 PDT 2011


On 9/15/11 5:43 PM, "Dave" <ldlbad at hotmail.com> wrote:
>  In the file Gfx I read the commands and I have access to the string of
>character directly from those commands, the text is a parameter, of TJ or
>Tj,

That's a recipe for FAILURE!

Most PDF documents in the real world do NOT do that.  The values in the
TJ/Tj are CIDs into the font!  You MUST use the font & encoding
information to get the correct values.


>since all the pieces of text from the same paragraph are always between BT
>(begin text) and ET (end text) I can correctly extract the whole
>paragraph, so i
>dont need to made any guess or more complex process.

Again, that's FAR FROM reality in the majority of PDFs.  I've seen
numerous examples where EACH WORD (or even each letter!) is in it's own
BT/ET block.


>The problem with this way
>is, sometimes instead of letters, I got some weird stuffs (it prints like
>a 2x2
>table with numbers),

See reason above.   And also a reason you need to get yourself a copy of
ISO 32000-1:2008.


Leonard



More information about the poppler mailing list