[poppler] On PDF Text Extraction

Wed Sep 19 07:55:14 PDT 2007

2007/9/19, Behdad Esfahbod <behdad at behdad.org>:
> Anyway, I wrote about PDF text extraction from the point of view of what
> cairo should be doing to generate perfectly text-extractable PDFs.
> Forwarding the message here as people may be interested.  I also point
> out a few poppler bugs.  I plan to fix them at some point, but it may be
> an obvious small fix to those familiar with the code base.

Two things to note, since you are talking about extracting information
from PDFs you created yourself:
- tagged PDF can embed more information in the PDF than pure glyphs and may help
- if tagged PDF is not enough, you can embed even more information
yourself using private structures

Best
   Martin