[cairo] cairo-gl glyph rendering performance

Tue Apr 19 15:30:58 PDT 2011

Hi all!

I have been investigating the cairo-gl glyph implementation to see if we
can improve the glyph rendering performance.

I have found that one source of performance loss is the overzealous selection
of the "via mask" path when rendering glyphs. When using the "via mask" path,
glyphs are first rendered to a temporary surface which is then used as a mask
to draw the glyphs on the final destination.

In the current code, one of the reasons to use the mask path is because the
glyphs overlap. Is this valid? I haven't figured out the technical reason for
this, so I may be missing something, but the following seems strange: let's
say we have two glyphs that actually overlap and we want to draw them. In the
present situation, if we draw the glyphs using a single glyph group they will
be drawn via a mask. If we draw them separately (in two groups), each will be
drawn using the normal path.  Why is one more correct than the other?  Do
glyphs belonging in the same group have some special connection?

In any case, the overlap detection test as implemented in
_cairo_scaled_font_glyph_device_extents() is not suited for our needs for two
reasons:

1. The overlap detection algorithm checks the extents of each glyph against the
   current total extent of previously processed glyphs. This works fine as long
   as the glyph group is limited to a single line and drawn sequentially.
   However, for multi-line glyph groups or for groups with out of order glyphs,
   a false "overlap" is always detected. The ASCII figure below shows
   why:

   ---------    ---------    ---------
   |A B C D| => |A B C D| => |A B C D| False Overlap when checking X!
   ---------    |E      |    |E X    |
                ---------    ---------

2. Due to font kerning, glyphs extents are often found to be overlapping,
   although the glyphs themselves are not actually overlapping.

In real life applications (that use some high level layout library (eg pango))
issue (1) doesn't seem to be very frequent, as the glyph groups that are passed
to cairo are usually limited to single lines with glyphs rendered sequentially.
On the other hand, it is quite common to have at least one overlap per glyph
group because of kerning and therefore the "via mask" path is selected much
more often than it should.

As an experiment, I commented out the "use_mask |= overlap" line in
_cairo_gl_surface_show_glyphs() and measured the performance difference, to get
a feeling of the improvements we can get:

a. For r600g Mesa git (GLX_MESA_multithread_makecurrent):
           firefox-talos-gfx gnome-terminal-vim poppler firefox-planet-gnome
overlap          54.073             15.656       9.421         83.339
no-overlap       17.489             13.480       3.051         74.192

b. For i965 Mesa git (GLX_MESA_multithread_makecurrent):
           firefox-talos-gfx gnome-terminal-vim poppler firefox-planet-gnome
overlap          37.167              7.556       7.045         44.158
no-overlap       36.439              7.064       3.253         42.432

c. For r600g-gles2 Mesa git:
           firefox-talos-gfx gnome-terminal-vim poppler firefox-planet-gnome
overlap          71.671             17.747      13.133         96.790
no-overlap       21.794             16.624       3.693         86.268

d. For i965 Mesa 7.10.2:
           firefox-talos-gfx gnome-terminal-vim poppler firefox-planet-gnome
overlap          62.069             10.806      13.451           -
no-overlap       42.275              7.895       4.546           -

Judging from the results above, it seems that one of the main benefits of
avoiding the "via mask" path is reducing the cost of glx/egl context switches
(the ones that are happening because we have to change the target surface to
draw on the mask). The GLX_MESA_multithread_makecurrent helps a lot with this
(as expected) as can be seen by samples (a) and (b). Still, avoiding the "via
mask" path for overlaps (no-overlap) even when taking advantage of this
extension, offers significant improvements in some cases (eg r600g
firefox-talos-gfx, poppler) and smaller but still nice improvements in the
rest.

For EGL/GLES2 and GLX implementations that doesn't currently have an extension
similar to GLX_MESA_multithread_makecurrent, avoiding the "via mask" path is
even more important as can been seen by samples (c) and (d). All in all I think
it is worthwhile to investigate how to minimize the usage of the "via mask" path.

The important question here is how can actually achieve using the "via mask"
path less. Can we remove the overlap factor completely? Assuming that not using
a mask is wrong, how wrong are the results going to be? If the visual
difference is small enough perhaps we can make this compromise to increase
performance (or use an environment variable and leave it to the user to force
the fast behavior).

If we need to keep the overlap test, we have to solve issue (1) and especially
(2).  Although I can at least imagine a way to tackle (1), I have no idea of
how to solve (2) in an efficient manner without extra information given to the
backend by eg pango.

I am looking forward to comments and to corrections of any misconceptions
I have!

Thanks,
Alexandros