[poppler] On PDF Text Extraction

Martin Schröder martin at oneiros.de
Wed Sep 19 07:55:14 PDT 2007


2007/9/19, Behdad Esfahbod <behdad at behdad.org>:
> Anyway, I wrote about PDF text extraction from the point of view of what
> cairo should be doing to generate perfectly text-extractable PDFs.
> Forwarding the message here as people may be interested.  I also point
> out a few poppler bugs.  I plan to fix them at some point, but it may be
> an obvious small fix to those familiar with the code base.

Two things to note, since you are talking about extracting information
from PDFs you created yourself:
- tagged PDF can embed more information in the PDF than pure glyphs and may help
- if tagged PDF is not enough, you can embed even more information
yourself using private structures

Best
   Martin


More information about the poppler mailing list