[poppler] On PDF Text Extraction
Martin Schröder
martin at oneiros.de
Wed Sep 19 07:55:14 PDT 2007
2007/9/19, Behdad Esfahbod <behdad at behdad.org>:
> Anyway, I wrote about PDF text extraction from the point of view of what
> cairo should be doing to generate perfectly text-extractable PDFs.
> Forwarding the message here as people may be interested. I also point
> out a few poppler bugs. I plan to fix them at some point, but it may be
> an obvious small fix to those familiar with the code base.
Two things to note, since you are talking about extracting information
from PDFs you created yourself:
- tagged PDF can embed more information in the PDF than pure glyphs and may help
- if tagged PDF is not enough, you can embed even more information
yourself using private structures
Best
Martin
More information about the poppler
mailing list