[cairo] New tessellator work and thoughts
M Joonas Pihlaja
jpihlaja at cc.helsinki.fi
Mon Nov 20 18:07:57 PST 2006
I've been working on the new tessellator and have been prodded to
write up this report. The short version is: for simple geometries,
it's now almost as fast as the current version, and it's faster on
highly intersecting geometries. However, there's more work to do.
_The_ main obstacle when doing nifty things in the tessellator, I
think, is that the current tessellator is fast for simple
geometries, and we need to keep it as fast.
* Stuff that's been done
What kicked this off was of course the 128 bit division. This
has been replaced by a 96/64 -> 32 bit quotient, 64 bit remainder
version. The new faster division can be done with at most one
64/32 bit division and a 64/64 bit division in the worst case,
but unfortunately, I couldn't get gcc to compile the 64/32 bit
division using a single div, so it's two 64/64 bit divisions
Next, profiling showed that after the division got faster, the
major bottleneck in the code was either the skip list (for simple
cases with not too many intersections), or the trapezoid
generation as Carl mentioned before (for complex cases).
We were mallocing and freeing a lot (~15%) while inserting
Bentley-Ottman events and sweepline elements. Freelists of nodes
helped here. The skip list comparators were doing lots of
arithmetic even for common cases, but adding a whole bunch of
fast paths to them helped out with that. We were also doing skip
list lookups right before inserting new intersections to avoid
non-unique intersections in the event queue, but merging the
uniqueness checking into the insertion operation cut the skip
list time in half again.
Additionally, on every insert the skip list would call random()
twice, which turned out to be taking about 15%. I suspect it was
using the time locking and unlocking the shared mutex used to
protect the global random state, or something. In any case,
replacing the random() calls with a shift register based
generator seemed to help with that.
Finally, on a marginally higher level, it turned out that the
majority of the skip list operations were due to the initial
sorting of polygon edges into start/stop events. This was being
done by repeated calls to skip_list_insert() to prime the event
queue. In the highly intersecting case this really doesn't
matter as the skip lists are quite large anyway due to the number
of intersections, but in the simple cases it was taking up a lot
of time, simply because the number of start/stop events in the
event skip list was causing overwhelming overhead when inserting
relatively short lived intersection events.
So to reduce the overhead, we're currently presorting the start
and stop events using qsort() up front, and using a skip list
only for dynamic intersection events. Now even though the
initial implementation had almost zero effect on the runtime, it
clearly partitioned the profiles of the time spent doing each
operation, so it was relatively straight forward to optimise the
presorting part to give a measurable difference.
Summing up the work so far, the new tessellator is "way faster"
for the highly intersecting cases, and "marginally slower" for
the simple cases, when compared with the current tessellator.
With simple geometries the line intersection is taking about
40-50 % of the tessellation time again, which I think is actually
a really good thing, since the new-tessellator-64-bit branch in
Carl's tree is promising to cut the intersection time down by at
least a half. Not that we'll be able to touch the one and only
true graphics ninja any time soon. ;)
* Some numbers
Cairo-perf-diff of the new tessellator as it stands now, vs. the
current master branch:
image-rgb tessellate-256-100 1427.75 0.02% -> 305.24 0.31%: 4.68x speedup
image-rgba tessellate-256-100 1424.60 0.15% -> 305.58 0.28%: 4.66x speedup
image-rgba zrusin_another_tessellate-415 151.18 0.05% -> 39.66 0.10%: 3.81x speedup
image-rgb zrusin_another_tessellate-415 151.35 0.04% -> 39.72 0.21%: 3.81x speedup
image-rgba tessellate-64-100 18.91 0.07% -> 5.79 0.67%: 3.26x speedup
image-rgb tessellate-64-100 18.91 0.21% -> 5.82 0.53%: 3.25x speedup
image-rgba zrusin_another_fill-415 238.60 1.03% -> 128.15 0.11%: 1.86x speedup
image-rgb zrusin_another_fill-415 238.79 0.95% -> 128.46 0.09%: 1.86x speedup
image-rgba world_map-800 771.15 0.03% -> 688.68 0.01%: 1.12x speedup
image-rgb world_map-800 769.12 0.08% -> 687.60 0.03%: 1.12x speedup
image-rgba fill_solid_rgb_over-64 0.39 1.36% -> 0.46 1.24%: 1.17x slowdown
image-rgb fill_solid_rgb_over-64 0.40 1.47% -> 0.46 2.68%: 1.16x slowdown
image-rgba fill_solid_rgba_over-64 0.47 2.05% -> 0.54 2.15%: 1.15x slowdown
image-rgb fill_solid_rgba_over-64 0.47 2.02% -> 0.54 0.68%: 1.14x slowdown
image-rgba fill_image_rgba_over-64 0.50 1.30% -> 0.57 1.12%: 1.14x slowdown
Note that in the world_map case, using the image backend, it's
breaking into rasterisation, stroke tessellation, fill
tessellation operations as:
new tessellator current master
rasterisation 65 % 60 %
tessellate fill 22 % 29 %
tessellate stroke 13 % 11 %
These were all computed on an age old Athlon TB. YMMV.
* Generate fewer trapezoids
Carl already mentioned this, but I'd like to elaborate on some of
the issues involved, because they form sort of a roadmap of
issues to deal with in the future.
So right now the tessellator is generating a trapezoid for every
line segment in the sweep line, for every event that changes the
current y value of the sweep -- that is, for N line segments that
have O(N^2) intersections at different y's, we're generating in
O(N) traps for each intersection by cutting the polygon
horizontally, for a whopping running time of O(N^3).
The main idea for fixing this is of course quite simple: only
generate a few traps per intersection or start/stop event when
you need to. There are at least two ways to go about implementing
this. The first one still does O(N^3) operations by doing the
"horizontal cutting" action, but instead of generating traps for
each active segment in the sweep line, it defers the trap
generation for as long as possible. This isn't all that horrible
of an idea, even if asymptotically it's the same, because a
simple horizontal iteration across the sweep line is really
The other approach maintains the winding number of the region to
the right of each segment in the sweep line incrementally. Say
you have a bunch of events that occur at the same y-coordinate.
Then due to the way that the event queue is constructed, you get
to see these in non-decreasing x-order, and typically most
changes are "local" in the sense that the winding number of only
a few regions changes
So, to incrementally compute the winding numbers during event
processing for a single y coordinate, you keep a pointer into the
segments on the sweep line that marks the first edge that
corresponds to a winding number that you *don't* know yet. When
you next get an event on the same sweep line, you move the
pointer forward one edge at a time recomputing the winding
numbers as you go, until you find an edge whose winding number
didn't change or you hit the edge that was modified by the event.
Either stopping condition lets you fast forward the pointer to
the current event, and hey presto, we're at O(1) winding number
computation time per intersection.
Btw, note that even though you might end up scanning O(N) edges
with the pointer above, you're still only doing O(1) work per
intersection. The only reason you might need to scan more than
one or two edges forward before you can fast forward you pointer,
is because you have _horizontal_ intersections in the geometry --
which aren't otherwise noted in the Bentley-Ottman algorithm.
Extending the incremental winding number computation into a
deferred trapezoid generation implementation is left as an
exercise for the reader. ;)
* All is not so rosy in integer land.
OK, so actually trying to implement the incremental trapezoid
generation algorithm above, I have learned to hate and fear
degeneracies. Also, zrusin_another_tessellate.
The main issue is precision of the intersection points -- we're
not doing Hobby's tolerance squares yet, instead snapping
intersection points to grid points as soon as possible. Btw,
when I say "grid points" here I mean the grid that includes the
two guard bits in the new tessellator. But they might as well
not exist in terms of
The problem is that we really ought to be able to order the
segments on the sweep line coherently throughout their life span,
but we're currently getting confused by nearly coincident edges,
shared intersection/start/stop points. On the up side, none of
that matters currently thanks to a) 16.16 representation making
those types of degeneracies "rare", and more importantly, b) the
excess trapezoidization hides the difficult bits and makes the
errors invisible. But when trap generation is deferred to the
last possible moment, we're going to need to start caring about
sweep line ordering for those cases again.
Putting aside the tolerance square algorithm aside for the
moment, we'd need about five times the number of bits of the
input coordinates to accurately compute intersection coordinates
"losslessly". i.e. lossless in the sense that comparisons between
the approximate intersections respect the comparison order of the
(Fast and loose: If we have an N bit integer, it takes a fixed
point number format 1.2N just to be able to losslessly take the
inverse 1/x of x. Since when intersecting segments with N bit
coordinates, you end up computing the division of a 3N bit
dividend and a 2N bit divisor, we would need a 3N.4N fixed point
format to represent all possible line intersections. This is 7N
bits in total, but we're only interested in intersections with
the top 2N bits zero, so it's only 1N.4N for us.)
But that's for fixed point formats that have to take worst cases
into account. For floating point formats we only need about 3N
bits of mantissa to compute intersections with N bit coordinates.
(I've been saying "about mN bits" because we need an extra bit
here and there to account for overflow when doing things like
edge vector offsets and such.)
* Exact arithmetic -- not only for weenies!
Do we really care, and if so what can we do about it? Well, I
think it's not going to be a huge problem even if we completely
disregard the snapping issues. It's likely possible to do a
careful implementation of the deferred trapezoidization that
doesn't defer trap generation to the last minute, but only as far
as it is safe to without segment ordering issues.
Another possibility would be to look out for and fix orderings
gone wrong in the sweep line as we incrementally compute winding
numbers. While a straightforward implementation of that takes us
back into O(N^2) land again, it's likely not going to happen
often, so this is the fix I'm going to attempt first.
Having said that, the move to 24.8 fixed point does have the
potential to make the robustness issues more visible at the user
level. So we should keep in mind the possibility of computing
with exact arithmetic by keeping intersection coordinates in the
<integer> + <remainder>/<denominator>.
The denominators and remainders will be ~64 bits, and comparisons
will be about as fast as now for the majority of points, where we
can differentiate by the integer portion only. It's only for
rare multiple intersections that we need to do any extra work at
About 40-50% of the time spent in the wideint division code is
taken up, according to callgrind, by the relatively mundane
problem of dealing with correct signs of the results. Now, not
only is that kind of a silly thing to be wasting time on, it's
also doing the wrong thing, because the signed division is made
to simulate the common 2's complement symmetric division. This
leads to negative remainders sometimes, and is the wrong thing
when snapping to the grid (we could fix the snapping to take into
account for this, but it doesn't seem worth it.) So Carl was
thinking that we could move to assuming non-negative inputs in
the tessellator by offseting coordinates at the start.
Another thing is that the skip list does seem like it's got a lot
of overhead for small lists. It would be useful to look into a
few different data structures here and see how they do.
Breaking through n log(n) in common cases: For simple geometries,
we're dominated algorithmically by the presorting of the
start/stop events. We should look into using a radix-like sort
based on the range of the inputs. Or even simply splitting the
y-range into equidistant buckets and qsorting those.
Other stuff to do is implementing Hobby's tolerance squares and
incorporating the faster 64 bit arithmetic, but those are, I
think, fairly low priority until we get the trapezoid generation
More information about the cairo