[cairo] After 1.3.2, what's going to make cairo faster next?

Thu Nov 16 12:17:22 PST 2006

On Wed, 15 Nov 2006 15:28:54 -0800, Carl Worth wrote:
> We do have several performance problems identified and patches in
> progress for several issues that still aren't in this snapshot, so
> there's definitely some more exciting stuff to come, (more about that
> in my follow ups to the cairo list). But please test this snapshot out
> and let us know if it helps performance for your applications or not.

Hopefully people will come back soon and let us know how the new
snapshot works out.

In the meantime, though, we've got plenty of stuff to keep us busy,
especially with our plan to have new snapshots out every week. We want
to keep that up as best we can until cairo 1.4, which we've now
scheduled for "early January 2007" in order to meet the Gnome 2.18
time table, and should also be soon enough for Firefox 3.0.

Here's a list of some of the performance improvements we've got
planned along with some "help wanted" postings for anyone that wants
to help out.

Oh, and if your favorite cairo performance annoyance isn't mentioned
below, then please, please let us know about it, and ideally supply as
detailed a test case as possible, (though we are getting better at
using sneaky tricks to extract test cases from full applications).
We'll be using the cairo performance test suite to guide our efforts
at improving cairo, so if your pet needs are represented there, then
we won't be helping you.

OK, so first, some stuff for which we already have some code in hand:

* New tessellator

  *Help wanted*: Need a real-world testcase to show a problem

  I ended up not throwing the new tessellator into the 1.3.2
  snapshot. I had gotten it to be 1.5x faster on world_map, but I also
  got the same 1.5x speedup on the old tessellator with the same
  custom quadrilateral tessellation fix.

  At this point, I'll reinvestigate to see if the new tessellator is
  not any slower on any current test cases. If not, then it's just a
  matter of cleaning up the 64-bit division and 24.8 patches and
  letting it land, (also perhaps putting floating-point intersection
  back when that's good enough and a performance win).

  Meanwhile, we know that the new tessellator is a performance win on
  our random test case with the inordinate number of intersections,
  but it would really help to have a real-world test case showing
  scalability problems in the old algorithm. Does anybody thing they
  have one? I think we're probably looking for as many intersections
  as possible in a non-fabricated example.

  Meanwhile, Joonas has been going over the new tessellator with a
  fine-tooth combed and an eye for detail. He's been accounting for
  every bit in the arithmetic, and optimizing the division operations,
  and the random-number generation for the skip lists. I'm hoping to
  see some interesting stuff come out of that work.

* David Turner's linear gradient optimization

  *Help wanted*: Perceptual Diff fix for test suite?

  I've tested this patch and it definitely provides a 2x speedup for
  linear gradients as exercised in the cairo performance suite. One
  trick with the code is that it touches pixman, which means that the
  same patch really needs to get pushed out to fb in the X server so
  that the Render fallbacks are also twice as fast for linear
  gradients.

  I looked into doing that and found that pixman and xserver/fb had
  diverged too much to make it easy. So I spent a fair amount of time
  merging those two trees up, (finding optimizations on cairo's side
  and bug fixes on the X server side), which is work I need to get
  pushed out now.

  Another issue is that the gradient patch changes the results of the
  test suite output (in acceptable ways) so the reference images all
  need to be adjusted, (and the X server will need to be updated
  before it matches). Something that might help reduce a lot of pain
  here is the perceptual diff stuff that was recently proposed. Is
  anyone looking into that?

* David Turner's other patches (spline decomposition storage, etc.)

  *Help wanted*: Need motivation performance test case

  I looked at these patches just a little bit, but wasn't able to
  measure any large performance impact. The problem there might be
  that the current tests in the performance suite just aren't adequate.
  For example, all of the tests with really long paths, (tessellate,
  zrusin_another, and world_map), are dominated by straight line
  segments rather than curves. If we can find a case where these
  patches make some measurable improvement, we can easily land them.

* Søren's optimization for CompositeGeneral getting hit by poppler
  clipping

  *Help wanted*: Replicating/investigating weird thrashing behavior

  This is the patch that inspired Jeff Muizelaar's magic test-case-
  extracting wrapper for cairo. When I tried his very simple clipping
  example in the cairo performance suite it sent my machine into a fit
  of thrashing from which it was hard to recover. The test looks
  really innocent so there's clearly a bad bug here somewhere,
  (perhaps unrelated to the original problem being investigated). It
  might even have a connection with the X server trapezoid mask sizing
  bug mentioned below.

* Identity-matrix short-circuit and other floating-point magic

  *Help wanted*: Daniel, coming back to cairo land soon?

  Aivars Kalvans submitted a nice fix that short circuits unnecessary
  floating-point operations for identity-matrix transformations.
  Daniel Amelang, (who seems to have take on a new, much-needed role
  of cleaning up good, unfinished work), has said he's got a
  cleaned-up version of that patch along with some other
  floating-point magic to help out with the text rendering performance
  problems that Jorn Baayen profiled in pango-cairo. I'm looking
  forward to hearing more from Dan when he recovers from recent midterm
  exams.

Then there are other performance issues that have been identified, but
for which there hasn't been any new code written yet. My plan is to
get performance tests in place for these as quickly as possible and
then use profiling to prioritize which should be worked on first:

* Stroke vs. fill for rectilinear figures

  *Help wanted*: Need the trivial performance test case

  We've mentioned before that some GTK+ theme authors found they could
  get a big performance improvement by using cairo_fill() instead of
  cairo_stroke() to draw things like a single-pixel wide outline of a
  rectangle. The performance tests to write are obvious (draw the same
  figure once with stroke() and once with fill()).

  The fast path here triggers when all trapezoid edges are
  rectilinear and at integer positions on the device-pixel grid. In
  the case of stroke() the exterior edges of the complete shape meet
  those criteria, but the invisible interior edges shared by multiple
  trapezoids are ending up at half-integer positions, (this is
  happening at the 90-degree joins on the path).

  My initial idea for getting stroke to hit the same fast path was to
  outline the entire stroke and send the whole thing through the
  tessellator. This would have fixed this bug, but an experiment with
  world_map this week suggests its a loss in general, (replacing our
  current O(n) incremental trapezoidization for stroke with the
  O(nlogn) tessellation).

  An alternate fix would be to just special-case the handling of a
  path consisting of just horizontal and vertical elements, miter
  joins, an integer line width, and coordinates putting the exterior
  points on integer positions. given all that, generating the
  rectangular trapezoids that compose the path should be quite simple.

* X server generates too-large mask for trapezoids

  *Help wanted*: Need the trivial performance test case

  There's been a long-standing bug in the X server, (one of those
  things we fixed in pixman a long time ago), where when handling
  XRenderCompositeTrapezoids it constructs a mask large enough to fit
  all the trapezoids without consideration for the fact that the
  destination surface (or its current clip region) might be much
  smaller.

  This leads to more allocation than necessary, but worse it causes a
  bunch of rasterization work filling in pixels that will never be
  examined.

  One fix is to get the simple fix from pixman to the X server, and
  then wait for the X server to get updated everywhere. Another fix
  that should also be done is to switch cairo to use XRenderAddTraps
  where it will then be responsible for allocating the mask explicitly
  rather than trusting the X server to do it.

* Generate fewer trapezoids

  *Help wanted*: Need the star + rectangle performance test case

  Zack Rusin mentioned something to me after his blog posting about
  cairo's tessellator (and Qt's old tessellator) generating many
  unnecessary trapezoids. Zack says fixing this was a major part of
  his Qt speedup work.

  Joonas recently measured that in the tessellate-256 case we have 256
  edges that intersect 7332 times but that we generate 765958 (!)
  trapezoids. So there's a big problem there, (which should exist in
  both old and new tessellators).

  Here's a test case that should expose a large part of the problem, I
  think (thanks to Kevin Martin for help in designing this):

  First, draw a many-pointed star on the left, then a rectangle of the
  same size on the right. This is one test case.

  Second, draw the same star and rectangle, but as one path with
  two sub-paths. This is the second test case.

  The performance difference between these two should expose much of
  the problem. The bug is that when the tessellator is processing edge
  events, (start, stop, and intersect), it is generating trapezoids
  for every active edge rather than just those trapezoids that are
  directly caused by the event of interest.

  So the rectangle is being sliced up into many thing trapezoids due
  to unrelated edge events occurring in the star, (and in the star
  itself, many more trapezoids are being generated than necessary as
  each point is being sliced up due to edge events of neighboring
  points).

  There should be a big win possible here.

* Generate better trapezoids

  The test case for this one is already present---just fill a circle.

  While the previous bug can be fixed by generating fewer trapezoids,
  here's a case where we can improve performance by generating more
  trapezoids, but being smarter about it. (Note: Some systems might
  be able to rasterize trapezoids fast enough that only the total
  number matters and this idea could slow things down. In practice, we
  don't have hardware-accelerated trapezoid rasterization happening in
  cairo, yet, so that's not the case).

  So, given a large circle, consider what cairo's tessellator is
  currently generating. It first represents the circle with straight
  line segments to within 0.1 device pixels. It then generates
  trapezoids based on these edges, so they will be really thin, but
  very wide, (from the left side to the right side).

  Given a large enough circle, most pixels affected by the operation
  will be fully covered, but we are not ever generating any trapezoids
  that cover a full pixel. Instead the fully-covered pixels are only
  getting filled in completely by a little contribution from many
  different thin trapezoids.

  Given a software rasterizer, the performance would be much better if
  we generated small trapezoids only within the pixels at the edge of
  our circle. Then, for the fully-covered interior of our circle, we
  could generate one trapezoid per scanline which could be rasterized
  with nothing more than memset.

  This optimization could be just as easily applied to either the old
  or the new tessellator.

* Improve software rasterization

  The software rasterizer that we have in cairo now, (and in the X
  server), is really slow, and the results aren't even all that
  good. It accepts coordinates with 16 bits of sub-pixel precision and
  then uses a regular (yuck!) 17x15 grid of sample points and counts
  the number of points within each pixel by stepping each trapezoid
  edge to each of these 15 rows within each pixel.

  Here's a case where we can do something faster and also get a much
  better result. First, notice that there is no advantage to having 16
  bits of sub-pixel precision. We (currently) only have 8-bits of
  alpha to store the rasterization result, and the artefacts of the
  regular grid reduce our effective resolution even more.

  We've already talked and decided that 8 bits of sub-pixel precision
  really "ought to be enough for anybody".

  One idea I had is to even drop down to 4 bits of sub-pixel
  precision. At that point, there are few enough relative entry and
  exit points for an edge and a pixel that the rasterization problem
  can be reduced to a small (< 2000 entry) table lookup. And a
  table-lookup approach would allow us to use a lovely, stochastic
  sampling pattern instead of regular grid, (and since we're
  pre-computing the table we can afford to do a nice Poisson Disc
  pattern). The improvements from the better grid might even be enough
  to outweigh the reduction in quality from only using 4 bits as
  opposed to 8. Or maybe we decide to go with a bigger table (~ 30000
  entries) and get up to 6 bits. Or we make it configurable.

Anyway, that's some of the stuff we're currently looking into. As
always, the more help we get, the quicker it all goes.

And if I haven't said it enough lately, I really want to express my
appreciation to everybody who uses, tests, plays with, improves, and
has fun with cairo. You all make this whole experience a lot of fun
for me, and all of you should feel very much a part of what cairo
is. It wouldn't be what it is without you. So thanks!

Have fun,

-Carl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20061116/fda9f45b/attachment.pgp