[cairo] Improving PDF output
otaylor at redhat.com
Tue Jan 9 11:47:51 PST 2007
On Tue, 2007-01-09 at 13:38 -0500, Behdad Esfahbod wrote:
> On Tue, 2007-01-09 at 10:38 -0500, Owen Taylor wrote:
> > It's worth pointing out here that there is a second way of associating
> > text with a PDF document ... it can also be done by providing
> > ActualText
> > entries in the structure tree" for the document. This is really the
> > only
> > way that selection of text from certain complex-text languages is
> > going
> > to work.
> > What I don't know is what (if any) PDF viewers support encoding text
> > this way, but adding support for that to the
> > Pango/cairo/poppler/evince
> > stack would be a fun project for someone. Why should cutting and
> > pasting
> > of text from PDF documents be restricted to Western and CJK languages?
> > It would require cairo and (low-level) Pango API changes, since the
> > information about the original text is gone by the time that the PDF
> > layer gets its hands on the glyphs.
> Yes, this is exactly what we were talking about last night on IRC. My
> current plan is:
> - Add something like cairo_show_graphemes() to cairo, that takes a
> text string, a glyph string, and the mapping between them that forms the
> - Pango apparently has all this information readily available and can
> make such a call very easily.
As long as people aren't using pango_show_glyph_string() directly...
normally, certainly yes.
> - Make Pango go over the glyphstring for right-to-left runs from end
> to start (that is, in logical order), and for right-to-left lines from
> end run to start run, or in the logical order of the runs.
I think you always want to emit glyphs in visual order and let the
backend figure out how to best encode that into a document for both
compactness (not positioning every glyph) and correctly representing
the text. Having glyph order not correspond to the X advance of the
font is going to be awkward.
> - The PDF backend will, for each grapheme, use reverse cmap lookup to
> get a text string associated with the glyphs. If this string is
> identical to the one provided, it will be used directly (like Alp's
> patch, or lack thereof, does), otherwise, it will start a new text
> operation for this grapheme and use ActualText around it.
Hmm, it might make sense to make that determination for the whole
string at once, to avoid an encoding for Hindi (say) where you are
constantly switching between the two representations? Since there
always are going to be some grapheme/characters that can be mapped
by the cmap.
Though I suppose you need to break up the ActualText markings to the
grapheme (more properly, cluster ... a 'ff' ligature is two graphemes
but one cluster) level to allow for proper selection boundaries.
> - Poppler/Evince just need to do the logical mapping between the glyph
> boundaries of the grapheme and the ActualText characters provided. That
> is, to break the width into the number of characters, etc. Some glib
> Unicode calls can help with which characters are cursor positions and
> which are not. Or rather, pango calls.
> I just wonder if the cairo API needs to know about right-to-left
> glyphstrings. Is there anything that can be encoded in the PDF?
Yes, there is a ReversedChars annotation that indicates that the
characters within the enclosed operator are in reverse of logical order.
The combination of that plus a ToUnicode map would probably allow
doing most Arabic without the use of ActualText. (Note the restriction
of ReversedChars to single words without embedded spaces, so you still
have to break up text to the word level.)
More information about the cairo