[cairo] PDF Text Extraction: Past and Present

Behdad Esfahbod behdad at behdad.org
Sat Feb 3 12:50:07 PST 2007


On Sat, 2007-02-03 at 12:43 -0500, Behdad Esfahbod wrote:
> On Sun, 2007-02-04 at 01:35 +1030, Adrian Johnson wrote:
> > Behdad Esfahbod wrote:
> > > To summarize, I suggest that we generate ToUnicode mappings for
> > > all fonts embedded in cairo's PDF output.  This should be done by
> > > calling into the font backends, passing in the scaled-font and an
> > > array of glyph indices, and get back an array of Unicode
> > > character codes.  It helps the backend if input glyphs are sorted
> > > numerically. The PDF backend then will build and add the
> > > ToUnicode CMap.
> > 
> > The attached patch
> >  - Generates ToUnicode mappings for all fonts
> >  - Adds a TrueType/OpenType reverse cmap lookup function.
> >  - Adds FT and Win32 font backend functions for mapping glyphs to
> >    unicode. These backend functions are fallbacks for when the
> >    reverse cmap fails (although for win32 the backend function
> >    only supports Type1 fonts).
> > 
> > Text selection works well in acroread however evince does not
> > correctly select TrueType fonts. This seems to be caused by
> > the individual glyph positioning in the content stream.
> 
> Thanks Adrian! 

Humm.  Gave a try.  Looks not good in Evince, nor in Adobe Reader for
Linux.  One obvious problem (irrelevant to your patch I guess) is that
fonts seem to have the wrong ascent/descent.  Anyway, attaching the test
case.  I use mixed.pango as markup input to Pango.  You can do that by
building caps and running "caps mixed.pango > mixed.pdf".


Thanks,

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759


-------------- next part --------------
<span size="12288">????????????
?????????
??????
Roses are Red,
Grass is Green. 2006
Arabic is ????? ????????
??? ???. ????
????. ?????. ?????.
?????? 2006
<span font_desc="Doulos SIL 28">Different &amp; Difficult.</span></span>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: caps.tar.gz
Type: application/x-compressed-tar
Size: 1358 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20070203/a2020e35/caps.tar.bin


More information about the cairo mailing list