[cairo] Improving PDF output
Behdad Esfahbod
behdad at behdad.org
Tue Jan 9 10:38:39 PST 2007
On Tue, 2007-01-09 at 10:38 -0500, Owen Taylor wrote:
>
> It's worth pointing out here that there is a second way of associating
> text with a PDF document ... it can also be done by providing
> ActualText
> entries in the structure tree" for the document. This is really the
> only
> way that selection of text from certain complex-text languages is
> going
> to work.
>
> What I don't know is what (if any) PDF viewers support encoding text
> this way, but adding support for that to the
> Pango/cairo/poppler/evince
> stack would be a fun project for someone. Why should cutting and
> pasting
> of text from PDF documents be restricted to Western and CJK languages?
>
> It would require cairo and (low-level) Pango API changes, since the
> information about the original text is gone by the time that the PDF
> layer gets its hands on the glyphs.
Yes, this is exactly what we were talking about last night on IRC. My
current plan is:
- Add something like cairo_show_graphemes() to cairo, that takes a
text string, a glyph string, and the mapping between them that forms the
graphemes.
- Pango apparently has all this information readily available and can
make such a call very easily.
- Make Pango go over the glyphstring for right-to-left runs from end
to start (that is, in logical order), and for right-to-left lines from
end run to start run, or in the logical order of the runs.
- The PDF backend will, for each grapheme, use reverse cmap lookup to
get a text string associated with the glyphs. If this string is
identical to the one provided, it will be used directly (like Alp's
patch, or lack thereof, does), otherwise, it will start a new text
operation for this grapheme and use ActualText around it.
- Poppler/Evince just need to do the logical mapping between the glyph
boundaries of the grapheme and the ActualText characters provided. That
is, to break the width into the number of characters, etc. Some glib
Unicode calls can help with which characters are cursor positions and
which are not. Or rather, pango calls.
I just wonder if the cairo API needs to know about right-to-left
glyphstrings. Is there anything that can be encoded in the PDF?
> - Owen
>
--
behdad
http://behdad.org/
"Those who would give up Essential Liberty to purchase a little
Temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin, 1759
More information about the cairo
mailing list