[poppler] On PDF Text Extraction
Behdad Esfahbod
behdad at behdad.org
Wed Sep 19 14:13:00 PDT 2007
On Wed, 2007-09-19 at 16:55 +0200, Martin Schröder wrote:
> 2007/9/19, Behdad Esfahbod <behdad at behdad.org>:
> > Anyway, I wrote about PDF text extraction from the point of view of what
> > cairo should be doing to generate perfectly text-extractable PDFs.
> > Forwarding the message here as people may be interested. I also point
> > out a few poppler bugs. I plan to fix them at some point, but it may be
> > an obvious small fix to those familiar with the code base.
>
> Two things to note, since you are talking about extracting information
> from PDFs you created yourself:
> - tagged PDF can embed more information in the PDF than pure glyphs and may help
> - if tagged PDF is not enough, you can embed even more information
> yourself using private structures
Thanks. That's not really the goal though. What I want to do is to
make pango+cairo generate PDFs that has text extractable in all common
viewers. Part of that work is to fix bugs in Poppler.
Tagged PDF allows for a lot more information to be stored, but it
doesn't solved the problem of glyph to text mapping.
> Best
> Martin
Regards,
--
behdad
http://behdad.org/
"Those who would give up Essential Liberty to purchase a little
Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759
More information about the poppler
mailing list