[cairo] PDF Text Extraction: Past and Present

Sat Feb 3 13:07:25 PST 2007

On 03/02/07, Behdad Esfahbod <behdad at behdad.org> wrote:
> On Fri, 2007-02-02 at 21:43 +0000, Baz wrote:
> > BTW one thing missing from your
> > excellent summary was the zapf table:
> Yeah, I didn't mention Zapt tables because there's no mention of them in
> the PDF reference (as far as I found).  So they are yet another
> non-standard way to text extraction from PDF.  They are kinda parallel
> to the ToUnicode mechanism.

That wasn't quite what I meant. I don't mean that we could generate
zapf tables in subsetted fonts for pdf to use, but that this is where
I'd look in the original font for the glyph->codepoint mappings for
_cairo_truetype_map_glyphs_to_unicode (instead of reversing cmap). Its
pretty irrelevant though, since zapf seems to be unused in the wild,
Adrian's approach is the right one.

-Baz