[cairo] After 1.3.2, what's going to make cairo faster next?
cworth at cworth.org
Thu Nov 16 12:17:22 PST 2006
On Wed, 15 Nov 2006 15:28:54 -0800, Carl Worth wrote:
> We do have several performance problems identified and patches in
> progress for several issues that still aren't in this snapshot, so
> there's definitely some more exciting stuff to come, (more about that
> in my follow ups to the cairo list). But please test this snapshot out
> and let us know if it helps performance for your applications or not.
Hopefully people will come back soon and let us know how the new
snapshot works out.
In the meantime, though, we've got plenty of stuff to keep us busy,
especially with our plan to have new snapshots out every week. We want
to keep that up as best we can until cairo 1.4, which we've now
scheduled for "early January 2007" in order to meet the Gnome 2.18
time table, and should also be soon enough for Firefox 3.0.
Here's a list of some of the performance improvements we've got
planned along with some "help wanted" postings for anyone that wants
to help out.
Oh, and if your favorite cairo performance annoyance isn't mentioned
below, then please, please let us know about it, and ideally supply as
detailed a test case as possible, (though we are getting better at
using sneaky tricks to extract test cases from full applications).
We'll be using the cairo performance test suite to guide our efforts
at improving cairo, so if your pet needs are represented there, then
we won't be helping you.
OK, so first, some stuff for which we already have some code in hand:
* New tessellator
*Help wanted*: Need a real-world testcase to show a problem
I ended up not throwing the new tessellator into the 1.3.2
snapshot. I had gotten it to be 1.5x faster on world_map, but I also
got the same 1.5x speedup on the old tessellator with the same
custom quadrilateral tessellation fix.
At this point, I'll reinvestigate to see if the new tessellator is
not any slower on any current test cases. If not, then it's just a
matter of cleaning up the 64-bit division and 24.8 patches and
letting it land, (also perhaps putting floating-point intersection
back when that's good enough and a performance win).
Meanwhile, we know that the new tessellator is a performance win on
our random test case with the inordinate number of intersections,
but it would really help to have a real-world test case showing
scalability problems in the old algorithm. Does anybody thing they
have one? I think we're probably looking for as many intersections
as possible in a non-fabricated example.
Meanwhile, Joonas has been going over the new tessellator with a
fine-tooth combed and an eye for detail. He's been accounting for
every bit in the arithmetic, and optimizing the division operations,
and the random-number generation for the skip lists. I'm hoping to
see some interesting stuff come out of that work.
* David Turner's linear gradient optimization
*Help wanted*: Perceptual Diff fix for test suite?
I've tested this patch and it definitely provides a 2x speedup for
linear gradients as exercised in the cairo performance suite. One
trick with the code is that it touches pixman, which means that the
same patch really needs to get pushed out to fb in the X server so
that the Render fallbacks are also twice as fast for linear
I looked into doing that and found that pixman and xserver/fb had
diverged too much to make it easy. So I spent a fair amount of time
merging those two trees up, (finding optimizations on cairo's side
and bug fixes on the X server side), which is work I need to get
pushed out now.
Another issue is that the gradient patch changes the results of the
test suite output (in acceptable ways) so the reference images all
need to be adjusted, (and the X server will need to be updated
before it matches). Something that might help reduce a lot of pain
here is the perceptual diff stuff that was recently proposed. Is
anyone looking into that?
* David Turner's other patches (spline decomposition storage, etc.)
*Help wanted*: Need motivation performance test case
I looked at these patches just a little bit, but wasn't able to
measure any large performance impact. The problem there might be
that the current tests in the performance suite just aren't adequate.
For example, all of the tests with really long paths, (tessellate,
zrusin_another, and world_map), are dominated by straight line
segments rather than curves. If we can find a case where these
patches make some measurable improvement, we can easily land them.
* Søren's optimization for CompositeGeneral getting hit by poppler
*Help wanted*: Replicating/investigating weird thrashing behavior
This is the patch that inspired Jeff Muizelaar's magic test-case-
extracting wrapper for cairo. When I tried his very simple clipping
example in the cairo performance suite it sent my machine into a fit
of thrashing from which it was hard to recover. The test looks
really innocent so there's clearly a bad bug here somewhere,
(perhaps unrelated to the original problem being investigated). It
might even have a connection with the X server trapezoid mask sizing
bug mentioned below.
* Identity-matrix short-circuit and other floating-point magic
*Help wanted*: Daniel, coming back to cairo land soon?
Aivars Kalvans submitted a nice fix that short circuits unnecessary
floating-point operations for identity-matrix transformations.
Daniel Amelang, (who seems to have take on a new, much-needed role
of cleaning up good, unfinished work), has said he's got a
cleaned-up version of that patch along with some other
floating-point magic to help out with the text rendering performance
problems that Jorn Baayen profiled in pango-cairo. I'm looking
forward to hearing more from Dan when he recovers from recent midterm
Then there are other performance issues that have been identified, but
for which there hasn't been any new code written yet. My plan is to
get performance tests in place for these as quickly as possible and
then use profiling to prioritize which should be worked on first:
* Stroke vs. fill for rectilinear figures
*Help wanted*: Need the trivial performance test case
We've mentioned before that some GTK+ theme authors found they could
get a big performance improvement by using cairo_fill() instead of
cairo_stroke() to draw things like a single-pixel wide outline of a
rectangle. The performance tests to write are obvious (draw the same
figure once with stroke() and once with fill()).
The fast path here triggers when all trapezoid edges are
rectilinear and at integer positions on the device-pixel grid. In
the case of stroke() the exterior edges of the complete shape meet
those criteria, but the invisible interior edges shared by multiple
trapezoids are ending up at half-integer positions, (this is
happening at the 90-degree joins on the path).
My initial idea for getting stroke to hit the same fast path was to
outline the entire stroke and send the whole thing through the
tessellator. This would have fixed this bug, but an experiment with
world_map this week suggests its a loss in general, (replacing our
current O(n) incremental trapezoidization for stroke with the
An alternate fix would be to just special-case the handling of a
path consisting of just horizontal and vertical elements, miter
joins, an integer line width, and coordinates putting the exterior
points on integer positions. given all that, generating the
rectangular trapezoids that compose the path should be quite simple.
* X server generates too-large mask for trapezoids
*Help wanted*: Need the trivial performance test case
There's been a long-standing bug in the X server, (one of those
things we fixed in pixman a long time ago), where when handling
XRenderCompositeTrapezoids it constructs a mask large enough to fit
all the trapezoids without consideration for the fact that the
destination surface (or its current clip region) might be much
This leads to more allocation than necessary, but worse it causes a
bunch of rasterization work filling in pixels that will never be
One fix is to get the simple fix from pixman to the X server, and
then wait for the X server to get updated everywhere. Another fix
that should also be done is to switch cairo to use XRenderAddTraps
where it will then be responsible for allocating the mask explicitly
rather than trusting the X server to do it.
* Generate fewer trapezoids
*Help wanted*: Need the star + rectangle performance test case
Zack Rusin mentioned something to me after his blog posting about
cairo's tessellator (and Qt's old tessellator) generating many
unnecessary trapezoids. Zack says fixing this was a major part of
his Qt speedup work.
Joonas recently measured that in the tessellate-256 case we have 256
edges that intersect 7332 times but that we generate 765958 (!)
trapezoids. So there's a big problem there, (which should exist in
both old and new tessellators).
Here's a test case that should expose a large part of the problem, I
think (thanks to Kevin Martin for help in designing this):
First, draw a many-pointed star on the left, then a rectangle of the
same size on the right. This is one test case.
Second, draw the same star and rectangle, but as one path with
two sub-paths. This is the second test case.
The performance difference between these two should expose much of
the problem. The bug is that when the tessellator is processing edge
events, (start, stop, and intersect), it is generating trapezoids
for every active edge rather than just those trapezoids that are
directly caused by the event of interest.
So the rectangle is being sliced up into many thing trapezoids due
to unrelated edge events occurring in the star, (and in the star
itself, many more trapezoids are being generated than necessary as
each point is being sliced up due to edge events of neighboring
There should be a big win possible here.
* Generate better trapezoids
The test case for this one is already present---just fill a circle.
While the previous bug can be fixed by generating fewer trapezoids,
here's a case where we can improve performance by generating more
trapezoids, but being smarter about it. (Note: Some systems might
be able to rasterize trapezoids fast enough that only the total
number matters and this idea could slow things down. In practice, we
don't have hardware-accelerated trapezoid rasterization happening in
cairo, yet, so that's not the case).
So, given a large circle, consider what cairo's tessellator is
currently generating. It first represents the circle with straight
line segments to within 0.1 device pixels. It then generates
trapezoids based on these edges, so they will be really thin, but
very wide, (from the left side to the right side).
Given a large enough circle, most pixels affected by the operation
will be fully covered, but we are not ever generating any trapezoids
that cover a full pixel. Instead the fully-covered pixels are only
getting filled in completely by a little contribution from many
different thin trapezoids.
Given a software rasterizer, the performance would be much better if
we generated small trapezoids only within the pixels at the edge of
our circle. Then, for the fully-covered interior of our circle, we
could generate one trapezoid per scanline which could be rasterized
with nothing more than memset.
This optimization could be just as easily applied to either the old
or the new tessellator.
* Improve software rasterization
The software rasterizer that we have in cairo now, (and in the X
server), is really slow, and the results aren't even all that
good. It accepts coordinates with 16 bits of sub-pixel precision and
then uses a regular (yuck!) 17x15 grid of sample points and counts
the number of points within each pixel by stepping each trapezoid
edge to each of these 15 rows within each pixel.
Here's a case where we can do something faster and also get a much
better result. First, notice that there is no advantage to having 16
bits of sub-pixel precision. We (currently) only have 8-bits of
alpha to store the rasterization result, and the artefacts of the
regular grid reduce our effective resolution even more.
We've already talked and decided that 8 bits of sub-pixel precision
really "ought to be enough for anybody".
One idea I had is to even drop down to 4 bits of sub-pixel
precision. At that point, there are few enough relative entry and
exit points for an edge and a pixel that the rasterization problem
can be reduced to a small (< 2000 entry) table lookup. And a
table-lookup approach would allow us to use a lovely, stochastic
sampling pattern instead of regular grid, (and since we're
pre-computing the table we can afford to do a nice Poisson Disc
pattern). The improvements from the better grid might even be enough
to outweigh the reduction in quality from only using 4 bits as
opposed to 8. Or maybe we decide to go with a bigger table (~ 30000
entries) and get up to 6 bits. Or we make it configurable.
Anyway, that's some of the stuff we're currently looking into. As
always, the more help we get, the quicker it all goes.
And if I haven't said it enough lately, I really want to express my
appreciation to everybody who uses, tests, plays with, improves, and
has fun with cairo. You all make this whole experience a lot of fun
for me, and all of you should feel very much a part of what cairo
is. It wouldn't be what it is without you. So thanks!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20061116/fda9f45b/attachment.pgp
More information about the cairo