[poppler] On PDF Text Extraction
Behdad Esfahbod
behdad at behdad.org
Tue Sep 18 16:08:36 PDT 2007
Hi,
Just joined the Poppler list. Excuse my ignorance if I don't know it
around here.
Anyway, I wrote about PDF text extraction from the point of view of what
cairo should be doing to generate perfectly text-extractable PDFs.
Forwarding the message here as people may be interested. I also point
out a few poppler bugs. I plan to fix them at some point, but it may be
an obvious small fix to those familiar with the code base.
Regards,
behdad
-------- Forwarded Message --------
From: Behdad Esfahbod <behdad at behdad.org>
To: cairo <cairo at cairographics.org>
Subject: PDF Text Extraction: Future
Date: Sun, 16 Sep 2007 19:37:53 -0400
Previously I wrote about the state of text extraction in PDF
generated by cairo. This is a follow-up message and uses
definitions from the first one. So, please review it before
reading on:
http://lists.freedesktop.org/archives/cairo/2007-February/009452.html
Warning: long long email. Refill your cup of coffee now.
=================================================================
Text Extraction, Future:
As already mentioned, perfect text extraction is not possible
when only the output glyphs are given to the PDF generator. So
we need a new API in cairo for that. This new API needs to take
both input text and output glyphs, and a mapping between the two.
To get into the mapping, we need new terminology:
- Cluster: A minimal pair of matching input text and output
glyphs. Being minimal, a cluster cannot be broken into two
clusters semantically. Most of the time clusters can be
grouped as one of the following kinds, though this does not
hold necessarily:
* Grapheme clusters. A grapheme is a piece of input text /
output glyphs that expresses one logical unit of text as
identified by the language of the text in question.
For example, "ö" is a grapheme. The input text
representing it may be U+00F6 LATIN SMALL LETTER O WITH
DIAERESIS, or <U+006F LATIN SMALL LETTER O,U+0308 COMBINING
DIAERESIS>, and the output glyphs may be a single glyph, or
two (one for "o", another for the diaeresis for example).
In Grapheme clusters, the input text is associated with the
output glyphs as a whole, and it is not desirable to
extract part of the input text if only part of the output
glyphs are selected.
* Ligature clusters. A ligature cluster includes input text
for more than one grapheme, but the graphemes interact
(typographically or orthographically) such that the output
glyphs cannot be divided further for individual graphemes.
An example of a ligature cluster that is purely
typographical is the "ff" ligature. It maps two input
graphemes <"f","f"> to a single glyph "ff".
An orthographical ligature cluster is the Arabic Lam-Alef
ligature that maps two input graphemes <"ل","ا"> to
something that looks like "ﻻ". This is a mandatory
ligature in Unicode and its shape is different than that of
Lam followed by Alef would otherwise look.
A ligature cluster most of the time has only one output
glyph, but it doesn't have to.
Since ligature clusters contain more than one grapheme, it
is perfectly valid for the user to want to select only
a subset of the graphemes. There is no general solution to
this problem, so most implementations simply divide the
width of an output glyph into the number of graphemes
linearly. So for example if you mouse over to the middle
of the "ff" ligature, only one "f" will be selected...
This is far from perfect for some Indic languages, but is
good enough.
To sum up, a cluster is an indivisible mapping of M input
characters to N output glyphs. We show this as M->N. Most
clusters are 1->1, but 1->N, M->1, and M->N are all commonly
found in more complex text.
There are also special cases where one of M or N is zero. A
M->0 cluster may be an ASCII tab character, or a U+200C ZERO
WIDTH NON-JOINER. We will find a use for 0->N later :-).
And a cluster mapping:
- Cluster Mapping: A sequence of clusters, each encoded as the
number of characters and the number of glyphs that it
consumes. The mapping also may have a "backward" flag set on
it, in which case the output glyphs will be consumed from the
end of the array to the beginning. This is needed in
right-to-left runs of text for example.
So, our new cairo API takes input text, output glyphs, and
cluster mapping. No _extents or _path variants are needed. I
have prototyped this call in the show_text_glyphs branch in my
tree. Here is the prototype:
typedef struct {
unsigned char num_bytes;
unsigned char num_glyphs;
} cairo_text_cluster_t;
cairo_public void
cairo_show_text_glyphs (cairo_t *cr,
const char *utf8,
int utf8_len,
const cairo_glyph_t *glyphs,
int num_glyphs,
const cairo_text_cluster_t *clusters,
int num_clusters,
cairo_bool_t backward);
Yes, show_text_glyphs is not the best name in the world, but is
much better than show_clusters, show_text_clusters, etc...
The implementation tries the show_text_glyphs method of the
target surface and if that doesn't work, falls back to
show_glyphs.
It is important to get the API right for 1.6, such that Pango can
move to using it. The actual PDF implementation can be fixed
gradually and later.
Note that PDF is not the only backend that can make good use of
this call. SVG can do similar stuff, and may have an option to
throw the glyphs away completely and just embed the text.
=================================================================
PDF Implementation:
Before I started research that led to this thread, I wrote some
stuff about this, which I now see does not work. Specifically,
ActualText is not supported in poppler (and possibly other
extractors) at all, so that cannot be part of a portable
solution. Here is the old post for reference:
http://lists.freedesktop.org/archives/cairo/2007-January/009127.html
I now have a solid plan that requires changes in cairo, pango,
and poppler, but all the changes are gradual and small, and it
should work with other viewers/extractors reasonably well too, but
I have not tested them yet.
The basic idea is that instead of allocating one code-point per
unique glyph, we now will allocate one code-point per unique M->1
clusters. The reason is that, as I summarized in the first mail,
each code point maps to exactly one glyph and a string of text,
using the ToUnicode mapping. This seemingly obvious idea solves
a lot of problems. First, we don't have to care about what the
default character for a glyph is. We just don't care!
So here is a first pass on the algorithm:
for each cluster:
- if it's M->0, add the index of an "empty" glyph to the
cluster output glyphs and act as if it's M->1.
- if it's M->1, search for this cluster in current
code-points, and allocate a new code-point if not
found, mapping to the single glyph involved and the
input text (using ToUnicode).
- otherwise it's M->N where N > 1. Break the cluster
into N pseudo-clusters, each having exactly one glyph,
and 0 or more input characters. The break can be done
linearly, or taking glyph advances or positions into
account. For each resulting pseudo-cluster, act as if
it's a M->1 cluster.
Discussion:
- It's crucial for the above algorithm to work that a ToUnicode
entry mapping a glyph to an empty string works. That is, a
glyph that maps to zero Unicode characters. Nowhere in the
CMap spec I found whether this is allowed or not. I don't
remember if Acrobat handles it correctly, but poppler has
some bugs: when you select such a character, it's not
rendered correctly and it outputs a U+FFFD in the extracted
text instead of nothing.
A nasty hack to relax this is to output an "ignorable"
Unicode character instead. Something like U+2060 WORD
JOINER, or if that affects Indic shaping, a nastier one like
U+2063 INVISIBLE SEPARATOR, or something equally useless and
ignorable (general category Cf).
Another way around it if many viewers don't support it is to
add new composite glyphs to the font that combine multiple
glyphs into one, so we don't have to use 0->N clusters. Some
font formats may not support composite glyphs however.
- Every font may need to have an "empty" glyph, that is most
useful if zero-width. This is to be able to include things
like U+200C ZERO WIDTH NON-JOINER in the extracted text. As
these normally don't map to any glyphs, and every code-point
maps to exactly one glyph, we need a dummy glyph to be able
to preserve such invisible character. This also includes things
like ASCII tab character. It should be trivial to add such a
glyph in all font formats.
- When selecting a M->1 cluster, poppler divides a glyph's
width linearly into M parts, one for each character. This is
similar to what Pango does, except that Pango only allows
selection at grapheme boundaries, not each character boundary.
There does not seem to be any way to convey this information
in PDF. But viewers can be overcome this rather easily: the
grapheme boundary (aka cursor-position) information is font
independent and only depends on the input text (plus Unicode
Character Database). So, Poppler (and other viewers) can be
made to call into Pango or similar systems to find out
grapheme boundaries and allow selection only at those points.
- Backward: the backward mapping flag is needed as part of the
cluster mapping, but it may as well be used to ensure that
extracted bidi text behaves the same way as it was intended.
The main problem is that PDF doesn't have an easy way to
convey bidi information. There is a ReversedText property
but it belongs to Tagged PDF part of the spec which is far
from supported. We will assume that glyphs in a PDF are
always layed out in the visual order, that is, from left to
right. There is nothing preventing a library generating
glyphs that have a negative advance width and so go in the
logical order for right-to-left text, but it's not common
practice and most probably not very well supported. So we'll
stay with LTR glyphs for now.
The problem then is, how a viewer like Poppler should do the
reverse-bidi step to get to the logical text, from the visual
text extracted from the glyphs. Poppler currently does a
heuristic based on the Bidi direction of characters
extracted, and while it does a good job at it, lets just say
that it doesn't do it very correctly. What it does is, it
finds maximal runs of RTL chars, reverses the text extracted
for that run, and sandwich it between a U+202B RIGHT-TO-LEFT
EMBEDDING and U+202C POP DIRECTIONAL FORMATTING pair.
Obvious problems with this are:
o The hard thing about reverse-bidi is how to handle
"neutral" characters: those without a strong LTR or RTL
direction. Poppler doesn't try to group them with
adjacent RTL runs.
o With a run of all-RTL chars, there's absolutely no need
to put RLE/PDF around it.
o Instead of reversing the glyphs and then extracting text
from them and append, it extracts text first and then
reverse. So if a glyph maps to two or more characters,
those come out backward in the extracted text, which is
wrong.
We can do better by making some assumptions about the
generated PDF, and working a bit harder. The assumption to
make about the PDF is that each text show operation contains
a run of text with exactly one direction. That is, assume
that the text layout engine broke text into runs based on
direction among other things. This holds true for virtually
all text layout engines I know.
Another way to improve the situation is to break text runs
into hopefully less ambiguous ones in Pango, to make the job
of the extractor easier. This can be used for example if
both LTR and RTL characters are available in a text run.
Then the text extractor does this:
- For every line, extract text from all segments, mark each
segment as LTR or RTL or neutral based o the type of
chars found into it.
- Group adjacent runs of the same type together.
- Resolve neutral groups to neighboring group if both are
of the same type, otherwise resolve to "base direction"
of the line (to be devised).
- Do some special-casing for digits, etc.
This is really about writing a sophisticated reverse-bidi
algorithm, something that I've wanted to do for such a long
time and I may as well tackle it this time. Lacking that, I
have observed before that, a good approximation to a
reverse-bidi algorithm is the bidi algorithm itself. So one
may feel lucky and use pango_log2vis_get_embedding_levels()
directly (or from FriBidi). It's mostly about getting the
details right, and I can't say more before getting down to
code. One complication is that the text stream may have bidi
override marks itself as we now can and want to embed them in
the text stream.
- Other poppler issues I faced:
o When a glyph maps to more than one character, when part of
the glyph is selected, it renders the glyph multiple times.
Should be obvious to fix.
o When a glyph maps to zero chars, it doesn't render the
glyph in selection and copies a space character instead.
More important is to see how Adobe Reader handles it.
o Has serious problems copying combining marks positioned
using the Td operation.
o It does a hefty job at column detection which is good, but
has serious problems. What I understand from reading the
code is that it sorts all the glyphs in a visual order (top
to bottom, left to right), and processes them in that
order. It should instead work on a text-operation level
for the least. My guess is that simply processing text
operations in the running order in the PDF file produces
much better results, but then again there probably have
been good reason to write the current code in the first
place.
- Thinking about implementation, ToUnicode tables are
independent of font formats, so one piece of code can
generate them nicely, which is mostly the case already. But
the codepoint-to-glyph mapping lives in different places for
different font types. So the code to handle it is spread
over both cairo-*-subset.c files and cairo-*-surface.c files.
But given all the PDF expertise we have thanks to Adrian's
awesome work, I wouldn't worry much about implementation :-).
- Pango also needs a new method in PangoRenderer called
draw_item which is the Pango equivalent of show_text_glyphs.
That's easy. A harder issue is to make Pango render runs in
a line in the logical order as opposed to the current visual
order. That's a bit harder because of the way
PangoLayoutLine struct has been made public and can't be
changed. With runs rendered in logical order, viewers that
simply follow PDF order work best, and those sorting
runs/glyphs based on their position will not work worse than
they currently do.
=================================================================
Other Issues:
Some other random thought about PDF text output that is not
necessarily related to the text extraction problem, but passed
through my mind during these experiments:
- A problem about using composite fonts is that when you find
out that you need a composite font (that is, more than 255
glyphs of the font should be subsetted), it's too late to
switch, since you have already output PDF code the previous
glyphs as single-byte codepoints. So one ends up using
composite fonts unconditionally (exception is, if the
original font has less than 256 glyphs, there's no point in
using composite fonts at all.). This slightly wastes space
as each codepoint will be encoded as four hex bytes instead
of two. This is assuming the simple UCS-2 Identity CMap
codepoint encoding that maps two bytes to a glyph value of
0..65535. However, one can use a UTF-8 scheme such that the
first 128 glyphs are accessed using one byte and so on. This
way the PDF generator can use a simple font for subsets with
less than 128 glyphs. However, Adrian is telling me that
Poppler only supports UCS-2 Identity CMap mapping for CID
font codepoint encoding. So this may not be feasible. It
does support arbitrary CMap mappings for ToUnicode tables
though, so the code certainly is there... The PDF spec also
says that:
Acrobat Versions 4.0 and 5.0 (PDF Versions 1.3 and 1.4,
respectively) must use “ToUnicode” mapping files that are
restricted to UCS-2 (Big Endian) encoding, which is
equivalent to UTF-16BE encoding without Surrogates.
(Note how the "fi" in "files" is encoded as a U+FB01 LATIN
SMALL LIGATURE FI because I copy/pasted from the spec.)
- Shall we use standard encodings if all the used glyphs in a
subset are in a well-supported standard encoding? May be
worth the slight optimization. Also may make generated
PS/PDF more readable for the case of simple ASCII text.
- Also occurred to me that in PDF almost all objects can come
after hey are referenced. Does this mean we can write out
pages as we go and avoid writing to a temp file that we
currently do?
- TaggedPDF allows for a lot more, but it's very hard to for
example mark all paragraphs and pages and all. There's also
a ReversedText tag in there. In most cases though, seems
like we can do the same without it as far as text extraction
is concerned. Some cairo API may be added to allow TaggedPDF
marking from higher level. Something like:
cairo_pdf_marked_content_sequence_t
cairo_pdf_surface_begin/end_marked_content()
- Other possibly useful PDF APIs will be for embedding
arbitrary objects, for marking ActualText around arbitrary
drawing operations, and to get object number for any embedded
fonts, patterns, etc. We are still waiting for someone more
familiar with PDF and higher-level use-cases to tell us what
exactly they need from cairo...
=================================================================
Anyway, wow, 400 lines. Thanks for reading this far. I'm going
to do a presentation on PDF text extraction at the Linux
Foundation OpenPrinting Summit next week in Montréal, mostly based on
this mail:
http://www.linux-foundation.org/en/OpenPrinting/SummitMontreal
I try to hack something up this week, but most probably I just
get to do the Pango side and wait for some PDF ninja to do the
cairo side. You know who you are!
Cheers,
--
behdad
http://behdad.org/
"Those who would give up Essential Liberty to purchase a little
Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759
More information about the poppler
mailing list