[cairo] Improving PDF output

Alp Toker alp at atoker.com
Sun Jan 7 05:53:57 PST 2007


Right now, PDFs generated by Cairo look OK. However, attempts to select 
or copy text in a viewer will fail if the TrueType font subsetter has 
kicked in.

I did some research to learn about the issues.

The subsetter is emitting a bijective format 6 'cmap' entry 
(http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6cmap.html). 
Trouble with this is that it obliterates unicode mapping information. I 
modified the TrueType subsetting code to map subset glyph IDs back to 
unicode entries, providing a more sensible cmap section, and then 
modified the PDF surface to use this reverse mapping to include proper 
unicode strings in the generated PDF rather than arbitrary glyph IDs.

This makes it possible to use text extraction utilities on the PDF, as 
well as making the generated PDF source more readable, and possibly 
easier to optimise.

However, this was not enough, and text selection was still awkward in 
Evince. Further investigation showed that the values provided by 
cairo_truetype_subset_t such as x/y_min/max, ascent/descent and widths 
were incorrect. In fact, simply halving each of these values resulted in 
correct behaviour when selecting text. There were still some glitches, 
but it was now possible to select and copy text out of a Cairo-generated 
PDF using arbitrary TrueType fonts in Evince.

Finally, I noticed the file size of my generated document was much 
larger than expected. Review of the generated PDF showed that a very 
verbose syntax was being used for text output:

   BT
   /CairoFont-0-0 1 Tf
   10.72 0 -0 -10.72 120.032 972.133333 Tm <56> Tj
   10.72 0 -0 -10.72 127.39203 972.133333 Tm <65> Tj
   10.72 0 -0 -10.72 133.791978 972.133333 Tm <72> Tj
   10.72 0 -0 -10.72 138.271973 972.133333 Tm <73> Tj
   10.72 0 -0 -10.72 143.871994 972.133333 Tm <69> Tj
   10.72 0 -0 -10.72 146.752029 972.133333 Tm <6f> Tj
   10.72 0 -0 -10.72 153.312026 972.133333 Tm <6e> Tj
   10.72 0 -0 -10.72 160.032072 972.133333 Tm <20> Tj
   ET

Each "glyph" is shown with a complete transform matrix. This was easily 
reduced to one Tm op followed by simpler translation commands.

This in itself was not a huge win, but it allowed for further 
simplification by analysis of the translations to see if they match the 
expected translation from metadata embedded in the font. If this happens 
to be the case, we can omit the translations entirely and allow the PDF 
viewer to deal with them. So the resulting PDF output looks something 
like this (where "Version" was the original string):

   BT
   /CairoFont-0-0 1 Tf
   10.72 0 -0 -10.72 120.032 972.133333 Tm
   (Version) Tj
   ET

Since my document is a 500 page long text, you can imagine this has some 
benefit, and indeed output size goes from 80M down to some 20M, but more 
importantly, the generated PDF has improved archival value as the 
content can be indexed by Beagle, read by screen-readers etc.

I'd be interested to hear any thoughts on these approaches to improving 
PDF output in Cairo.

PS. Reading through the code, I spotted a few typos for which a fix is 
attached.
-------------- next part --------------
>From 7089cbde389719b766355f446f05c91221267c01 Mon Sep 17 00:00:00 2001
From: Alp Toker <alp at atoker.com>
Date: Sun, 7 Jan 2007 02:03:30 +0000
Subject: [PATCH] Fix various code/comment typos

---
 pixman/src/pixregion.c    |    2 +-
 src/cairo-matrix.c        |    2 +-
 src/cairo-pdf-surface.c   |   20 ++++++++++----------
 src/cairo-win32-surface.c |    2 +-
 src/cairo-xcb-surface.c   |    4 ++--
 src/cairo-xlib-surface.c  |   10 +++++-----
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/pixman/src/pixregion.c b/pixman/src/pixregion.c
index 5112157..0404dff 100644
--- a/pixman/src/pixregion.c
+++ b/pixman/src/pixregion.c
@@ -1421,7 +1421,7 @@ QuickSortRects(
  *      Step 2. Split the rectangles into the minimum number of proper y-x
  *		banded regions.  This may require horizontally merging
  *		rectangles, and vertically coalescing bands.  With any luck,
- *		this step in an identity tranformation (ala the Box widget),
+ *		this step in an identity transformation (ala the Box widget),
  *		or a coalescing into 1 box (ala Menus).
  *
  *	Step 3. Merge the separate regions down to a single region by calling
diff --git a/src/cairo-matrix.c b/src/cairo-matrix.c
index d4f4bf7..b689f2a 100644
--- a/src/cairo-matrix.c
+++ b/src/cairo-matrix.c
@@ -102,7 +102,7 @@ slim_hidden_def(cairo_matrix_init);
  * @x0: location to store x0 (X-translation component) of matrix, or %NULL
  * @y0: location to store y0 (Y-translation component) of matrix, or %NULL
  *
- * Gets the matrix values for the affine tranformation that @matrix represents.
+ * Gets the matrix values for the affine transformation that @matrix represents.
  * See cairo_matrix_init().
  *
  *
diff --git a/src/cairo-pdf-surface.c b/src/cairo-pdf-surface.c
index a59cd99..2778d34 100644
--- a/src/cairo-pdf-surface.c
+++ b/src/cairo-pdf-surface.c
@@ -626,7 +626,7 @@ compress_dup (const void *data, unsigned long data_size,
 }
 
 /* Emit alpha channel from the image into the given data, providing
- * and id that can be used to reference the resulting SMask object.
+ * an id that can be used to reference the resulting SMask object.
  *
  * In the case that the alpha channel happens to be all opaque, then
  * no SMask object will be emitted and *id_ret will be set to 0.
@@ -1006,7 +1006,7 @@ emit_linear_colorgradient (cairo_pdf_surface_t		*surface,
 }
 
 static cairo_pdf_resource_t
-emit_stiched_colorgradient (cairo_pdf_surface_t   *surface,
+emit_stitched_colorgradient (cairo_pdf_surface_t   *surface,
 			    unsigned int 	   n_stops,
 			    cairo_pdf_color_stop_t stops[])
 {
@@ -1020,7 +1020,7 @@ emit_stiched_colorgradient (cairo_pdf_surface_t   *surface,
 						       &stops[i+1]);
     }
 
-    /* ... and stich them together */
+    /* ... and stitch them together */
     function = _cairo_pdf_surface_new_object (surface);
     _cairo_output_stream_printf (surface->output,
 				 "%d 0 obj\r\n"
@@ -1065,7 +1065,7 @@ emit_stiched_colorgradient (cairo_pdf_surface_t   *surface,
     return function;
 }
 
-#define COLOR_STOP_EPSILLON 1e-6
+#define COLOR_STOP_EPSILON 1e-6
 
 static cairo_pdf_resource_t
 emit_pattern_stops (cairo_pdf_surface_t *surface, cairo_gradient_pattern_t *pattern)
@@ -1095,13 +1095,13 @@ emit_pattern_stops (cairo_pdf_surface_t *surface, cairo_gradient_pattern_t *patt
 
     /* make sure first offset is 0.0 and last offset is 1.0. (Otherwise Acrobat
      * Reader chokes.) */
-    if (stops[0].offset > COLOR_STOP_EPSILLON) {
+    if (stops[0].offset > COLOR_STOP_EPSILON) {
 	    memcpy (allstops, stops, sizeof (cairo_pdf_color_stop_t));
 	    stops = allstops;
 	    stops[0].offset = 0.0;
 	    n_stops++;
     }
-    if (stops[n_stops-1].offset < 1.0 - COLOR_STOP_EPSILLON) {
+    if (stops[n_stops-1].offset < 1.0 - COLOR_STOP_EPSILON) {
 	    memcpy (&stops[n_stops],
 		    &stops[n_stops - 1],
 		    sizeof (cairo_pdf_color_stop_t));
@@ -1110,12 +1110,12 @@ emit_pattern_stops (cairo_pdf_surface_t *surface, cairo_gradient_pattern_t *patt
     }
 
     if (n_stops == 2) {
-	/* no need for stiched function */
+	/* no need for stitched function */
 	function = emit_linear_colorgradient (surface, &stops[0], &stops[1]);
     } else {
-	/* multiple stops: stich. XXX possible optimization: regulary spaced
-	 * stops do not require stiching. XXX */
-	function = emit_stiched_colorgradient (surface,
+	/* multiple stops: stitch. XXX possible optimization: regulary spaced
+	 * stops do not require stitching. XXX */
+	function = emit_stitched_colorgradient (surface,
 					       n_stops,
 					       stops);
     }
diff --git a/src/cairo-win32-surface.c b/src/cairo-win32-surface.c
index 120849d..2c2a5fd 100644
--- a/src/cairo-win32-surface.c
+++ b/src/cairo-win32-surface.c
@@ -1828,7 +1828,7 @@ cairo_win32_surface_get_dc (cairo_surface_t *surface)
 }
 
 /**
- * cario_win32_surface_get_image
+ * cairo_win32_surface_get_image
  * @surface: a #cairo_surface_t
  *
  * Returns a #cairo_surface_t image surface that refers to the same bits
diff --git a/src/cairo-xcb-surface.c b/src/cairo-xcb-surface.c
index 8b8ba1d..2fdf8a1 100644
--- a/src/cairo-xcb-surface.c
+++ b/src/cairo-xcb-surface.c
@@ -926,7 +926,7 @@ _operator_needs_alpha_composite (cairo_operator_t op,
 /* There is a bug in most older X servers with compositing using a
  * untransformed repeating source pattern when the source is in off-screen
  * video memory, and another with repeated transformed images using a
- * general tranform matrix. When these bugs could be triggered, we need a
+ * general transform matrix. When these bugs could be triggered, we need a
  * fallback: in the common case where we have no transformation and the
  * source and destination have the same format/visual, we can do the
  * operation using the core protocol for the first bug, otherwise, we need
@@ -2020,7 +2020,7 @@ _cairo_xcb_surface_add_glyph (xcb_connection_t *dpy,
      *
      *  This is a postscript-y model, where each glyph has its own
      *  coordinate space, so it's what we expose in terms of metrics. It's
-     *  apparantly what everyone's expecting. Everyone except the Render
+     *  apparently what everyone's expecting. Everyone except the Render
      *  extension. Render wants to see a glyph tile starting at (0,0), with
      *  an origin offset inside, like this:
      *
diff --git a/src/cairo-xlib-surface.c b/src/cairo-xlib-surface.c
index fbfae75..6a0d3e4 100644
--- a/src/cairo-xlib-surface.c
+++ b/src/cairo-xlib-surface.c
@@ -1106,7 +1106,7 @@ _operator_needs_alpha_composite (cairo_operator_t op,
 /* There is a bug in most older X servers with compositing using a
  * untransformed repeating source pattern when the source is in off-screen
  * video memory, and another with repeated transformed images using a
- * general tranform matrix. When these bugs could be triggered, we need a
+ * general transform matrix. When these bugs could be triggered, we need a
  * fallback: in the common case where we have no transformation and the
  * source and destination have the same format/visual, we can do the
  * operation using the core protocol for the first bug, otherwise, we need
@@ -1166,7 +1166,7 @@ _categorize_composite_operation (cairo_xlib_surface_t *dst,
 
 		/* If these are on the same screen but otherwise incompatible,
 		 * make a copy as core drawing can't cross depths and doesn't
-		 * work rightacross visuals of the same depth
+		 * work right across visuals of the same depth
 		 */
 		if (_cairo_xlib_surface_same_screen (dst, src) &&
 		    !_surfaces_compatible (dst, src))
@@ -2390,7 +2390,7 @@ _cairo_xlib_surface_add_glyph (Display *dpy,
      *
      *  This is a postscript-y model, where each glyph has its own
      *  coordinate space, so it's what we expose in terms of metrics. It's
-     *  apparantly what everyone's expecting. Everyone except the Render
+     *  apparently what everyone's expecting. Everyone except the Render
      *  extension. Render wants to see a glyph tile starting at (0,0), with
      *  an origin offset inside, like this:
      *
@@ -2694,9 +2694,9 @@ _cairo_xlib_surface_emit_glyphs (cairo_xlib_surface_t *dst,
 	 * the first zero-size glyph.  However, we don't skip all size-zero
 	 * glyphs, since that will force a new element at every space.  We
 	 * skip initial size-zero glyphs and hope that it's enough.  Since
-	 * Xft never exposed that bug, this assumptation should be correct.
+	 * Xft never exposed that bug, this assumption should be correct.
 	 *
-	 * We also skip any glyph that hav troublesome coordinates.  We want
+	 * We also skip any glyphs that have troublesome coordinates.  We want
 	 * to make sure that (glyph2.x - (glyph1.x + glyph1.width)) fits in
 	 * a signed 16bit integer, otherwise it will overflow in the render
 	 * protocol.
-- 
1.5.0.rc0.g244a7



More information about the cairo mailing list