[poppler] On PDF Text Extraction

Tue Sep 18 16:08:36 PDT 2007

Hi,

Just joined the Poppler list.  Excuse my ignorance if I don't know it
around here.

Anyway, I wrote about PDF text extraction from the point of view of what
cairo should be doing to generate perfectly text-extractable PDFs.
Forwarding the message here as people may be interested.  I also point
out a few poppler bugs.  I plan to fix them at some point, but it may be
an obvious small fix to those familiar with the code base.

Regards,

behdad

-------- Forwarded Message --------
From: Behdad Esfahbod <behdad at behdad.org>
To: cairo <cairo at cairographics.org>
Subject: PDF Text Extraction: Future
Date: Sun, 16 Sep 2007 19:37:53 -0400

Previously I wrote about the state of text extraction in PDF
generated by cairo.  This is a follow-up message and uses
definitions from the first one.  So, please review it before
reading on:

  http://lists.freedesktop.org/archives/cairo/2007-February/009452.html

Warning: long long email.  Refill your cup of coffee now.

=================================================================

Text Extraction, Future:

As already mentioned, perfect text extraction is not possible
when only the output glyphs are given to the PDF generator.  So
we need a new API in cairo for that.  This new API needs to take
both input text and output glyphs, and a mapping between the two.
To get into the mapping, we need new terminology:

  - Cluster: A minimal pair of matching input text and output
    glyphs.  Being minimal, a cluster cannot be broken into two
    clusters semantically.  Most of the time clusters can be
    grouped as one of the following kinds, though this does not
    hold necessarily:

    * Grapheme clusters.  A grapheme is a piece of input text /
      output glyphs that expresses one logical unit of text as
      identified by the language of the text in question.

      For example, "ö" is a grapheme.  The input text
      representing it may be U+00F6 LATIN SMALL LETTER O WITH
      DIAERESIS, or <U+006F LATIN SMALL LETTER O,U+0308 COMBINING
      DIAERESIS>, and the output glyphs may be a single glyph, or
      two (one for "o", another for the diaeresis for example).

      In Grapheme clusters, the input text is associated with the
      output glyphs as a whole, and it is not desirable to
      extract part of the input text if only part of the output
      glyphs are selected.

    * Ligature clusters.  A ligature cluster includes input text
      for more than one grapheme, but the graphemes interact
      (typographically or orthographically) such that the output
      glyphs cannot be divided further for individual graphemes.

      An example of a ligature cluster that is purely
      typographical is the "ff" ligature.  It maps two input
      graphemes <"f","f"> to a single glyph "ff".

      An orthographical ligature cluster is the Arabic Lam-Alef
      ligature that maps two input graphemes <"ل","ا"> to
      something that looks like "ﻻ".  This is a mandatory
      ligature in Unicode and its shape is different than that of
      Lam followed by Alef would otherwise look.

      A ligature cluster most of the time has only one output
      glyph, but it doesn't have to.

      Since ligature clusters contain more than one grapheme, it
      is perfectly valid for the user to want to select only
      a subset of the graphemes.  There is no general solution to
      this problem, so most implementations simply divide the
      width of an output glyph into the number of graphemes
      linearly.  So for example if you mouse over to the middle
      of the "ff" ligature, only one "f" will be selected...
      This is far from perfect for some Indic languages, but is
      good enough.

    To sum up, a cluster is an indivisible mapping of M input
    characters to N output glyphs.  We show this as M->N.  Most
    clusters are 1->1, but 1->N, M->1, and M->N are all commonly
    found in more complex text.

    There are also special cases where one of M or N is zero.  A
    M->0 cluster may be an ASCII tab character, or a U+200C ZERO
    WIDTH NON-JOINER.  We will find a use for 0->N later :-).

And a cluster mapping:

  - Cluster Mapping: A sequence of clusters, each encoded as the
    number of characters and the number of glyphs that it
    consumes.  The mapping also may have a "backward" flag set on
    it, in which case the output glyphs will be consumed from the
    end of the array to the beginning.  This is needed in
    right-to-left runs of text for example.

So, our new cairo API takes input text, output glyphs, and
cluster mapping.  No _extents or _path variants are needed.  I
have prototyped this call in the show_text_glyphs branch in my
tree.  Here is the prototype:

typedef struct {
  unsigned char        num_bytes;
  unsigned char        num_glyphs;
} cairo_text_cluster_t;

cairo_public void
cairo_show_text_glyphs (cairo_t			   *cr,
			const char		   *utf8,
			int			    utf8_len,
			const cairo_glyph_t	   *glyphs,
			int			    num_glyphs,
			const cairo_text_cluster_t *clusters,
			int			    num_clusters,
			cairo_bool_t		    backward);

Yes, show_text_glyphs is not the best name in the world, but is
much better than show_clusters, show_text_clusters, etc...

The implementation tries the show_text_glyphs method of the
target surface and if that doesn't work, falls back to
show_glyphs.

It is important to get the API right for 1.6, such that Pango can
move to using it.  The actual PDF implementation can be fixed
gradually and later.

Note that PDF is not the only backend that can make good use of
this call.  SVG can do similar stuff, and may have an option to
throw the glyphs away completely and just embed the text.

=================================================================

PDF Implementation:

Before I started research that led to this thread, I wrote some
stuff about this, which I now see does not work.  Specifically,
ActualText is not supported in poppler (and possibly other
extractors) at all, so that cannot be part of a portable
solution.  Here is the old post for reference:

  http://lists.freedesktop.org/archives/cairo/2007-January/009127.html

I now have a solid plan that requires changes in cairo, pango,
and poppler, but all the changes are gradual and small, and it
should work with other viewers/extractors reasonably well too, but
I have not tested them yet.

The basic idea is that instead of allocating one code-point per
unique glyph, we now will allocate one code-point per unique M->1
clusters.  The reason is that, as I summarized in the first mail,
each code point maps to exactly one glyph and a string of text,
using the ToUnicode mapping.  This seemingly obvious idea solves
a lot of problems.  First, we don't have to care about what the
default character for a glyph is.  We just don't care!

So here is a first pass on the algorithm:

    for each cluster:

	- if it's M->0, add the index of an "empty" glyph to the 
	  cluster output glyphs and act as if it's M->1.

	- if it's M->1, search for this cluster in current
	  code-points, and allocate a new code-point if not
	  found, mapping to the single glyph involved and the
	  input text (using ToUnicode).

	- otherwise it's M->N where N > 1.  Break the cluster
	  into N pseudo-clusters, each having exactly one glyph,
	  and 0 or more input characters.  The break can be done
	  linearly, or taking glyph advances or positions into
	  account.  For each resulting pseudo-cluster, act as if
	  it's a M->1 cluster.

Discussion: 

  - It's crucial for the above algorithm to work that a ToUnicode
    entry mapping a glyph to an empty string works.  That is, a
    glyph that maps to zero Unicode characters.  Nowhere in the
    CMap spec I found whether this is allowed or not.  I don't
    remember if Acrobat handles it correctly, but poppler has
    some bugs: when you select such a character, it's not
    rendered correctly and it outputs a U+FFFD in the extracted
    text instead of nothing.

    A nasty hack to relax this is to output an "ignorable"
    Unicode character instead.  Something like U+2060 WORD
    JOINER, or if that affects Indic shaping, a nastier one like
    U+2063 INVISIBLE SEPARATOR, or something equally useless and
    ignorable (general category Cf).

    Another way around it if many viewers don't support it is to
    add new composite glyphs to the font that combine multiple
    glyphs into one, so we don't have to use 0->N clusters.  Some
    font formats may not support composite glyphs however.

  - Every font may need to have an "empty" glyph, that is most
    useful if zero-width.  This is to be able to include things
    like U+200C ZERO WIDTH NON-JOINER in the extracted text.  As
    these normally don't map to any glyphs, and every code-point
    maps to exactly one glyph, we need a dummy glyph to be able
    to preserve such invisible character.  This also includes things
    like ASCII tab character.  It should be trivial to add such a
    glyph in all font formats.

  - When selecting a M->1 cluster, poppler divides a glyph's
    width linearly into M parts, one for each character.  This is
    similar to what Pango does, except that Pango only allows
    selection at grapheme boundaries, not each character boundary.
    There does not seem to be any way to convey this information
    in PDF.  But viewers can be overcome this rather easily:  the
    grapheme boundary (aka cursor-position) information is font
    independent and only depends on the input text (plus Unicode
    Character Database).  So, Poppler (and other viewers) can be
    made to call into Pango or similar systems to find out
    grapheme boundaries and allow selection only at those points.

  - Backward: the backward mapping flag is needed as part of the
    cluster mapping, but it may as well be used to ensure that
    extracted bidi text behaves the same way as it was intended.

    The main problem is that PDF doesn't have an easy way to
    convey bidi information.  There is a ReversedText property
    but it belongs to Tagged PDF part of the spec which is far
    from supported.  We will assume that glyphs in a PDF are
    always layed out in the visual order, that is, from left to
    right.  There is nothing preventing a library generating
    glyphs that have a negative advance width and so go in the
    logical order for right-to-left text, but it's not common
    practice and most probably not very well supported.  So we'll
    stay with LTR glyphs for now.

    The problem then is, how a viewer like Poppler should do the
    reverse-bidi step to get to the logical text, from the visual
    text extracted from the glyphs.  Poppler currently does a
    heuristic based on the Bidi direction of characters
    extracted, and while it does a good job at it, lets just say
    that it doesn't do it very correctly.  What it does is, it
    finds maximal runs of RTL chars, reverses the text extracted
    for that run, and sandwich it between a U+202B RIGHT-TO-LEFT
    EMBEDDING and U+202C POP DIRECTIONAL FORMATTING pair.
    Obvious problems with this are:

    o The hard thing about reverse-bidi is how to handle
      "neutral" characters: those without a strong LTR or RTL
      direction.  Poppler doesn't try to group them with
      adjacent RTL runs.

    o With a run of all-RTL chars, there's absolutely no need
      to put RLE/PDF around it.

    o Instead of reversing the glyphs and then extracting text
      from them and append, it extracts text first and then
      reverse.  So if a glyph maps to two or more characters,
      those come out backward in the extracted text, which is
      wrong.

    We can do better by making some assumptions about the
    generated PDF, and working a bit harder.  The assumption to
    make about the PDF is that each text show operation contains
    a run of text with exactly one direction.  That is, assume
    that the text layout engine broke text into runs based on
    direction among other things.  This holds true for virtually
    all text layout engines I know.

    Another way to improve the situation is to break text runs
    into hopefully less ambiguous ones in Pango, to make the job
    of the extractor easier.  This can be used for example if
    both LTR and RTL characters are available in a text run.

    Then the text extractor does this:

    - For every line, extract text from all segments, mark each
      segment as LTR or RTL or neutral based o the type of
      chars found into it.

    - Group adjacent runs of the same type together.

    - Resolve neutral groups to neighboring group if both are
      of the same type, otherwise resolve to "base direction"
      of the line (to be devised).

    - Do some special-casing for digits, etc.

    This is really about writing a sophisticated reverse-bidi
    algorithm, something that I've wanted to do for such a long
    time and I may as well tackle it this time.  Lacking that, I
    have observed before that, a good approximation to a
    reverse-bidi algorithm is the bidi algorithm itself.  So one
    may feel lucky and use pango_log2vis_get_embedding_levels()
    directly (or from FriBidi).  It's mostly about getting the
    details right, and I can't say more before getting down to
    code.  One complication is that the text stream may have bidi
    override marks itself as we now can and want to embed them in
    the text stream.

  - Other poppler issues I faced:

    o When a glyph maps to more than one character, when part of
      the glyph is selected, it renders the glyph multiple times.
      Should be obvious to fix.

    o When a glyph maps to zero chars, it doesn't render the
      glyph in selection and copies a space character instead.
      More important is to see how Adobe Reader handles it.

    o Has serious problems copying combining marks positioned
      using the Td operation.

    o It does a hefty job at column detection which is good, but
      has serious problems.  What I understand from reading the
      code is that it sorts all the glyphs in a visual order (top
      to bottom, left to right), and processes them in that
      order.  It should instead work on a text-operation level
      for the least.  My guess is that simply processing text
      operations in the running order in the PDF file produces
      much better results, but then again there probably have
      been good reason to write the current code in the first
      place.

  - Thinking about implementation, ToUnicode tables are
    independent of font formats, so one piece of code can
    generate them nicely, which is mostly the case already.  But
    the codepoint-to-glyph mapping lives in different places for
    different font types.  So the code to handle it is spread
    over both cairo-*-subset.c files and cairo-*-surface.c files.
    But given all the PDF expertise we have thanks to Adrian's
    awesome work, I wouldn't worry much about implementation :-).

  - Pango also needs a new method in PangoRenderer called
    draw_item which is the Pango equivalent of show_text_glyphs.
    That's easy.  A harder issue is to make Pango render runs in
    a line in the logical order as opposed to the current visual
    order.  That's a bit harder because of the way
    PangoLayoutLine struct has been made public and can't be
    changed.  With runs rendered in logical order, viewers that
    simply follow PDF order work best, and those sorting
    runs/glyphs based on their position will not work worse than
    they currently do.

=================================================================

Other Issues:

Some other random thought about PDF text output that is not
necessarily related to the text extraction problem, but passed
through my mind during these experiments:

  - A problem about using composite fonts is that when you find
    out that you need a composite font (that is, more than 255
    glyphs of the font should be subsetted), it's too late to
    switch, since you have already output PDF code the previous
    glyphs as single-byte codepoints.  So one ends up using
    composite fonts unconditionally (exception is, if the
    original font has less than 256 glyphs, there's no point in
    using composite fonts at all.).  This slightly wastes space
    as each codepoint will be encoded as four hex bytes instead
    of two.  This is assuming the simple UCS-2 Identity CMap
    codepoint encoding that maps two bytes to a glyph value of
    0..65535.  However, one can use a UTF-8 scheme such that the
    first 128 glyphs are accessed using one byte and so on.  This
    way the PDF generator can use a simple font for subsets with
    less than 128 glyphs.  However, Adrian is telling me that
    Poppler only supports UCS-2 Identity CMap mapping for CID
    font codepoint encoding.  So this may not be feasible.  It
    does support arbitrary CMap mappings for ToUnicode tables
    though, so the code certainly is there...  The PDF spec also
    says that:

      Acrobat Versions 4.0 and 5.0 (PDF Versions 1.3 and 1.4,
      respectively) must use “ToUnicode” mapping ﬁles that are
      restricted to UCS-2 (Big Endian) encoding, which is
      equivalent to UTF-16BE encoding without Surrogates.

    (Note how the "fi" in "files" is encoded as a U+FB01 LATIN
    SMALL LIGATURE FI because I copy/pasted from the spec.)

  - Shall we use standard encodings if all the used glyphs in a
    subset are in a well-supported standard encoding?  May be
    worth the slight optimization.  Also may make generated
    PS/PDF more readable for the case of simple ASCII text.

  - Also occurred to me that in PDF almost all objects can come
    after hey are referenced.  Does this mean we can write out
    pages as we go and avoid writing to a temp file that we
    currently do?

  - TaggedPDF allows for a lot more, but it's very hard to for
    example mark all paragraphs and pages and all.  There's also
    a ReversedText tag in there.  In most cases though, seems
    like we can do the same without it as far as text extraction
    is concerned.  Some cairo API may be added to allow TaggedPDF
    marking from higher level.  Something like:

	cairo_pdf_marked_content_sequence_t
	cairo_pdf_surface_begin/end_marked_content()

  - Other possibly useful PDF APIs will be for embedding
    arbitrary objects, for marking ActualText around arbitrary
    drawing operations, and to get object number for any embedded
    fonts, patterns, etc.  We are still waiting for someone more
    familiar with PDF and higher-level use-cases to tell us what
    exactly they need from cairo...

=================================================================

Anyway, wow, 400 lines.  Thanks for reading this far.  I'm going
to do a presentation on PDF text extraction at the Linux
Foundation OpenPrinting Summit next week in Montréal, mostly based on
this mail:

  http://www.linux-foundation.org/en/OpenPrinting/SummitMontreal

I try to hack something up this week, but most probably I just
get to do the Pango side and wait for some PDF ninja to do the
cairo side.  You know who you are!

Cheers,

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759