[poppler] On PDF Text Extraction

Wed Sep 19 14:13:00 PDT 2007

On Wed, 2007-09-19 at 16:55 +0200, Martin Schröder wrote:
> 2007/9/19, Behdad Esfahbod <behdad at behdad.org>:
> > Anyway, I wrote about PDF text extraction from the point of view of what
> > cairo should be doing to generate perfectly text-extractable PDFs.
> > Forwarding the message here as people may be interested.  I also point
> > out a few poppler bugs.  I plan to fix them at some point, but it may be
> > an obvious small fix to those familiar with the code base.
> 
> Two things to note, since you are talking about extracting information
> from PDFs you created yourself:
> - tagged PDF can embed more information in the PDF than pure glyphs and may help
> - if tagged PDF is not enough, you can embed even more information
> yourself using private structures

Thanks.  That's not really the goal though.  What I want to do is to
make pango+cairo generate PDFs that has text extractable in all common
viewers.  Part of that work is to fix bugs in Poppler.

Tagged PDF allows for a lot more information to be stored, but it
doesn't solved the problem of glyph to text mapping.

> Best
>    Martin

Regards,
-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759