[cairo] New tessellator work and thoughts

Mon Nov 20 18:07:57 PST 2006

Hi cairo-l,

I've been working on the new tessellator and have been prodded to
write up this report.  The short version is: for simple geometries,
it's now almost as fast as the current version, and it's faster on
highly intersecting geometries.  However, there's more work to do.

_The_ main obstacle when doing nifty things in the tessellator, I 
think, is that the current tessellator is fast for simple 
geometries, and we need to keep it as fast.

* Stuff that's been done

What kicked this off was of course the 128 bit division.  This 
has been replaced by a 96/64 -> 32 bit quotient, 64 bit remainder 
version. The new faster division can be done with at most one 
64/32 bit division and a 64/64 bit division in the worst case, 
but unfortunately, I couldn't get gcc to compile the 64/32 bit 
division using a single div, so it's two 64/64 bit divisions 
currently.

Next, profiling showed that after the division got faster, the 
major bottleneck in the code was either the skip list (for simple 
cases with not too many intersections), or the trapezoid 
generation as Carl mentioned before (for complex cases).

We were mallocing and freeing a lot (~15%) while inserting 
Bentley-Ottman events and sweepline elements.  Freelists of nodes 
helped here.  The skip list comparators were doing lots of 
arithmetic even for common cases, but adding a whole bunch of 
fast paths to them helped out with that.  We were also doing skip 
list lookups right before inserting new intersections to avoid 
non-unique intersections in the event queue, but merging the 
uniqueness checking into the insertion operation cut the skip 
list time in half again.

Additionally, on every insert the skip list would call random() 
twice, which turned out to be taking about 15%.  I suspect it was 
using the time locking and unlocking the shared mutex used to 
protect the global random state, or something.  In any case, 
replacing the random() calls with a shift register based 
generator seemed to help with that.

Finally, on a marginally higher level, it turned out that the 
majority of the skip list operations were due to the initial 
sorting of polygon edges into start/stop events.  This was being 
done by repeated calls to skip_list_insert() to prime the event 
queue.  In the highly intersecting case this really doesn't 
matter as the skip lists are quite large anyway due to the number 
of intersections, but in the simple cases it was taking up a lot 
of time, simply because the number of start/stop events in the 
event skip list was causing overwhelming overhead when inserting 
relatively short lived intersection events.

So to reduce the overhead, we're currently presorting the start 
and stop events using qsort() up front, and using a skip list 
only for dynamic intersection events.  Now even though the 
initial implementation had almost zero effect on the runtime, it 
clearly partitioned the profiles of the time spent doing each 
operation, so it was relatively straight forward to optimise the 
presorting part to give a measurable difference.

Summing up the work so far, the new tessellator is "way faster" 
for the highly intersecting cases, and "marginally slower" for 
the simple cases, when compared with the current tessellator. 
With simple geometries the line intersection is taking about 
40-50 % of the tessellation time again, which I think is actually 
a really good thing, since the new-tessellator-64-bit branch in 
Carl's tree is promising to cut the intersection time down by at 
least a half.  Not that we'll be able to touch the one and only 
true graphics ninja any time soon. ;)

* Some numbers

Cairo-perf-diff of the new tessellator as it stands now, vs. the
current master branch:

image-rgb              tessellate-256-100  1427.75 0.02% -> 305.24 0.31%:  4.68x speedup
image-rgba             tessellate-256-100  1424.60 0.15% -> 305.58 0.28%:  4.66x speedup
image-rgba  zrusin_another_tessellate-415  151.18 0.05% ->  39.66 0.10%:  3.81x speedup
image-rgb   zrusin_another_tessellate-415  151.35 0.04% ->  39.72 0.21%:  3.81x speedup
image-rgba              tessellate-64-100   18.91 0.07% ->   5.79 0.67%:  3.26x speedup
image-rgb               tessellate-64-100   18.91 0.21% ->   5.82 0.53%:  3.25x speedup
image-rgba        zrusin_another_fill-415  238.60 1.03% -> 128.15 0.11%:  1.86x speedup
image-rgb         zrusin_another_fill-415  238.79 0.95% -> 128.46 0.09%:  1.86x speedup
image-rgba                  world_map-800  771.15 0.03% -> 688.68 0.01%:  1.12x speedup
image-rgb                   world_map-800  769.12 0.08% -> 687.60 0.03%:  1.12x speedup
[...]
image-rgba        fill_solid_rgb_over-64     0.39 1.36% ->   0.46 1.24%:  1.17x slowdown
image-rgb         fill_solid_rgb_over-64     0.40 1.47% ->   0.46 2.68%:  1.16x slowdown
image-rgba       fill_solid_rgba_over-64     0.47 2.05% ->   0.54 2.15%:  1.15x slowdown
image-rgb        fill_solid_rgba_over-64     0.47 2.02% ->   0.54 0.68%:  1.14x slowdown
image-rgba       fill_image_rgba_over-64     0.50 1.30% ->   0.57 1.12%:  1.14x slowdown
[...]

Note that in the world_map case, using the image backend, it's 
breaking into rasterisation, stroke tessellation, fill 
tessellation operations as:
 			new tessellator		current master
rasterisation		65 %			60 %
tessellate fill		22 %			29 %
tessellate stroke	13 %			11 %

These were all computed on an age old Athlon TB. YMMV.

* Generate fewer trapezoids

Carl already mentioned this, but I'd like to elaborate on some of 
the issues involved, because they form sort of a roadmap of 
issues to deal with in the future.

So right now the tessellator is generating a trapezoid for every 
line segment in the sweep line, for every event that changes the 
current y value of the sweep -- that is, for N line segments that 
have O(N^2) intersections at different y's, we're generating in 
O(N) traps for each intersection by cutting the polygon 
horizontally, for a whopping running time of O(N^3).

The main idea for fixing this is of course quite simple: only 
generate a few traps per intersection or start/stop event when 
you need to. There are at least two ways to go about implementing 
this.  The first one still does O(N^3) operations by doing the 
"horizontal cutting" action, but instead of generating traps for 
each active segment in the sweep line, it defers the trap 
generation for as long as possible. This isn't all that horrible 
of an idea, even if asymptotically it's the same, because a 
simple horizontal iteration across the sweep line is really 
fairly fast.

The other approach maintains the winding number of the region to 
the right of each segment in the sweep line incrementally.  Say 
you have a bunch of events that occur at the same y-coordinate. 
Then due to the way that the event queue is constructed, you get 
to see these in non-decreasing x-order, and typically most 
changes are "local" in the sense that the winding number of only 
a few regions changes

So, to incrementally compute the winding numbers during event 
processing for a single y coordinate, you keep a pointer into the 
segments on the sweep line that marks the first edge that 
corresponds to a winding number that you *don't* know yet.  When 
you next get an event on the same sweep line, you move the 
pointer forward one edge at a time recomputing the winding 
numbers as you go, until you find an edge whose winding number 
didn't change or you hit the edge that was modified by the event. 
Either stopping condition lets you fast forward the pointer to 
the current event, and hey presto, we're at O(1) winding number 
computation time per intersection.

Btw, note that even though you might end up scanning O(N) edges 
with the pointer above, you're still only doing O(1) work per 
intersection. The only reason you might need to scan more than 
one or two edges forward before you can fast forward you pointer, 
is because you have _horizontal_ intersections in the geometry -- 
which aren't otherwise noted in the Bentley-Ottman algorithm. 
Extending the incremental winding number computation into a 
deferred trapezoid generation implementation is left as an 
exercise for the reader. ;)

* All is not so rosy in integer land.

OK, so actually trying to implement the incremental trapezoid 
generation algorithm above, I have learned to hate and fear 
degeneracies.  Also, zrusin_another_tessellate.

The main issue is precision of the intersection points -- we're 
not doing Hobby's tolerance squares yet, instead snapping 
intersection points to grid points as soon as possible.  Btw, 
when I say "grid points" here I mean the grid that includes the 
two guard bits in the new tessellator.  But they might as well 
not exist in terms of

The problem is that we really ought to be able to order the 
segments on the sweep line coherently throughout their life span, 
but we're currently getting confused by nearly coincident edges, 
shared intersection/start/stop points.  On the up side, none of 
that matters currently thanks to a) 16.16 representation making 
those types of degeneracies "rare", and more importantly, b) the 
excess trapezoidization hides the difficult bits and makes the 
errors invisible.  But when trap generation is deferred to the 
last possible moment, we're going to need to start caring about 
sweep line ordering for those cases again.

Putting aside the tolerance square algorithm aside for the 
moment, we'd need about five times the number of bits of the 
input coordinates to accurately compute intersection coordinates 
"losslessly". i.e. lossless in the sense that comparisons between 
the approximate intersections respect the comparison order of the 
exact intersections.

(Fast and loose: If we have an N bit integer, it takes a fixed 
point number format 1.2N just to be able to losslessly take the 
inverse 1/x of x.  Since when intersecting segments with N bit 
coordinates, you end up computing the division of a 3N bit 
dividend and a 2N bit divisor, we would need a 3N.4N fixed point 
format to represent all possible line intersections.  This is 7N 
bits in total, but we're only interested in intersections with 
the top 2N bits zero, so it's only 1N.4N for us.)

But that's for fixed point formats that have to take worst cases 
into account.  For floating point formats we only need about 3N 
bits of mantissa to compute intersections with N bit coordinates. 
(I've been saying "about mN bits" because we need an extra bit 
here and there to account for overflow when doing things like 
edge vector offsets and such.)

* Exact arithmetic -- not only for weenies!

Do we really care, and if so what can we do about it?  Well, I 
think it's not going to be a huge problem even if we completely 
disregard the snapping issues.  It's likely possible to do a 
careful implementation of the deferred trapezoidization that 
doesn't defer trap generation to the last minute, but only as far 
as it is safe to without segment ordering issues.

Another possibility would be to look out for and fix orderings 
gone wrong in the sweep line as we incrementally compute winding 
numbers. While a straightforward implementation of that takes us 
back into O(N^2) land again, it's likely not going to happen 
often, so this is the fix I'm going to attempt first.

Having said that, the move to 24.8 fixed point does have the 
potential to make the robustness issues more visible at the user 
level.  So we should keep in mind the possibility of computing 
with exact arithmetic by keeping intersection coordinates in the 
format

 	<integer> + <remainder>/<denominator>.

The denominators and remainders will be ~64 bits, and comparisons 
will be about as fast as now for the majority of points, where we 
can differentiate by the integer portion only.  It's only for 
rare multiple intersections that we need to do any extra work at 
all.

* Miscellany

About 40-50% of the time spent in the wideint division code is 
taken up, according to callgrind, by the relatively mundane 
problem of dealing with correct signs of the results.  Now, not 
only is that kind of a silly thing to be wasting time on, it's 
also doing the wrong thing, because the signed division is made 
to simulate the common 2's complement symmetric division.  This 
leads to negative remainders sometimes, and is the wrong thing 
when snapping to the grid (we could fix the snapping to take into 
account for this, but it doesn't seem worth it.)  So Carl was 
thinking that we could move to assuming non-negative inputs in 
the tessellator by offseting coordinates at the start.

Another thing is that the skip list does seem like it's got a lot 
of overhead for small lists. It would be useful to look into a 
few different data structures here and see how they do.

Breaking through n log(n) in common cases: For simple geometries, 
we're dominated algorithmically by the presorting of the 
start/stop events.  We should look into using a radix-like sort 
based on the range of the inputs.  Or even simply splitting the 
y-range into equidistant buckets and qsorting those.

Other stuff to do is implementing Hobby's tolerance squares and 
incorporating the faster 64 bit arithmetic, but those are, I 
think, fairly low priority until we get the trapezoid generation 
sorted out.

Cheers,

Joonas