[cairo] [PATCH] add extents to clone_similar

Thu Oct 5 16:22:53 PDT 2006

On Thu, 05 Oct 2006 12:36:45 -0700, Carl Worth wrote:
> So everything's faster now, and the xlib-rgba cases are showing the
> constant performance scaling we'd like to see. The xlib-rgb cases
> improved but still have some badness in them. This suggests that the
> difference I pointed out above must be due to some additional
> full-surface copying badness in addition to the bug that this patch
> fixes.

I was chatting in #cairo about this problem with Owen Taylor, and he
made several observations about the measurement framework, (beginning
with some skepticism about the conclusion I came to above):

1) We were synchronizing with the X server in cairo_perf_timer_stop
   but not in cairo_perf_timer_start. This means that due to X
   lib/server buffering, any per-iteration setup code before
   cairo_perf_timer_start could be erroneously measured in the result.

   I've now pushed out a fix for this issue.

2) Perhaps the drawing operations we are measuring are too fast
   compared to the overhead of our measurement framework, (particularly
   with backends such as xlib where we have to synchronize with the
   external X server process).

The recent subimage_copy test really showcases this problem well since
it was designed to take effectively 0 time, (copying merely 16 pixels
from an image surface to the destination), so that if at larger
surface sizes it took any appreciable time at all that would
demonstrate the bug.

In order to better explore this issue, I expanded the subimage_copy
test into 6 variations each of which spends 10 times as many
iterations (in an inner loop---within timer_start() and timer_stop())
as the previous---each iteration running the same 16 pixel copy. Given
enough iterations, we should eventually overwhelm any overhead and
only be measuring the behavior of interest.

Here are the results with some analysis:

[ # ]  backend-content                test-size   mean ms std dev. iterations
[ 30]    image-rgba        subimage_copy-1-512     0.010  0.25%   100
[ 31]    image-rgba       subimage_copy-10-512     0.018  0.11%   100
[ 32]    image-rgba      subimage_copy-100-512     0.105  0.09%   100
[ 33]    image-rgba     subimage_copy-1000-512     0.981  0.44%   100
[ 34]    image-rgba    subimage_copy-10000-512     9.838  2.25%   100
[ 35]    image-rgba   subimage_copy-100000-512    99.403  0.37%   100

Beginning with this 100000 iteration case, we assume that the 99 ms
time is accurate (that is, any measurement overhead is no longer
significant). We can then work backward and predict times of 9.9 ms
and .98 ms for the previous two runs, and when we check, they are
actually slightly better than the predictions. So that looks good.

But at 100 iterations and fewer we start getting results that exceed
our predictions, (by anywhere from about 6 to 9 microseconds). So we
conclude that for the image backend any timing measurement result that
isn't at least several 10s of microseconds is highly suspicious.

[ 30]    image-rgb         subimage_copy-1-512     0.010  0.62%   100
[ 31]    image-rgb        subimage_copy-10-512     0.019  0.30%   100
[ 32]    image-rgb       subimage_copy-100-512     0.106  0.11%   100
[ 33]    image-rgb      subimage_copy-1000-512     0.985  0.47%   100
[ 34]    image-rgb     subimage_copy-10000-512     9.862  2.14%   100
[ 35]    image-rgb    subimage_copy-100000-512    99.851  0.40%   100

Here, with the rgb image surface destination, the results are almost
exactly as above.

[ 30]     xlib-rgba        subimage_copy-1-512     0.071  0.35%   100
[ 31]     xlib-rgba       subimage_copy-10-512     0.080  0.36%   100
[ 32]     xlib-rgba      subimage_copy-100-512     0.168  0.16%   100
[ 33]     xlib-rgba     subimage_copy-1000-512     1.043  0.82%   100
[ 34]     xlib-rgba    subimage_copy-10000-512     9.904  2.63%   100
[ 35]     xlib-rgba   subimage_copy-100000-512    99.409  0.52%   100

And here with the xlib-rgba (destination Pixmap) backend, we also see
a similar pattern. But here the measurements begin to exceed the
predictions at only 1000 iterations with errors ranging from about 50
to 70 microseconds). The fact that the error is larger than in the
image case is consistent with the fact that the xlib test framework
has more synchronization overhead, (from doing a 1x1 XGetImage at the
beginning and end of each timing loop). So we conclude that for this
backend we need test results to be in the hundreds of microseconds to
be reliable.

[ 30]     xlib-rgb         subimage_copy-1-512     9.518  2.06%   100
[ 31]     xlib-rgb        subimage_copy-10-512     9.555  2.47%   100
[ 32]     xlib-rgb       subimage_copy-100-512     9.624  2.03%   100
[ 33]     xlib-rgb      subimage_copy-1000-512    10.566  2.58%   100
[ 34]     xlib-rgb     subimage_copy-10000-512    19.594  1.92%   100
[ 35]     xlib-rgb    subimage_copy-100000-512   110.068  0.38%   100

Wow, look at that. There is some large overhead of about 0.5
milliseconds here. If we subtract that off from the last three cases
then we do get times that are consistent with the previous tests:

iterations	corrected mean (ms)
      1000	  1.066
     10000	 10.094
    100000	100.568

So then I checked to see whether this overhead is size-dependent or
not:

[  0]     xlib-rgb         subimage_copy-1-16      0.106  0.56%   100
[  6]     xlib-rgb         subimage_copy-1-32      0.135  0.73%   100
[ 12]     xlib-rgb         subimage_copy-1-64      0.252  0.95%   100
[ 18]     xlib-rgb         subimage_copy-1-128     0.651  3.80%   100
[ 24]     xlib-rgb         subimage_copy-1-256     2.359  1.80%   100
[ 30]     xlib-rgb         subimage_copy-1-512     9.518  2.06%   100

So there does appear to be some correlation with the number of pixels
there.

In trying to figure this out, there is another difference between what
happens with the xlib-rgba case and the xlib-rgb case. In the first,
we are using XCopyArea for the actual pixel copying, but in the second
we are using XRenderComposite instead.

This overhead we are seeing depends on the size of the destination
surface. I don't think it depends on the size of the source surface,
since from everything I can see so far, we're not actually passing
that size in any X code within the timing loop. And, significantly,
the overhead is fixed, regardless of how many iterations of copying we
do.

Could it be that there is some size-dependent overhead in the X server
that is triggered by the first Render operation to draw to the
Picture?

I guess one thing I could do to narrow this down is to keep the
destination surface size constant and only increase the size of the
source surface each time.

I don't really like the idea that xlib performance testing is
unreliable unless every iteration spends several 10s of milliseconds
drawing. I don't want to slow down our test suite that much, (though
perhaps I can compensate somewhat by reducing the number of iterations
in the outer loop without negatively impacting the standard deviation
of the results).

But I really would like to figure out what's happening here since we
may end up with some tests that really do require very large
destination surfaces, and it would be unfortunate to have large
measurement errors in these due to whatever effect is tripping us up
here.

-Carl

PS. There's an odd pattern in the standard deviation in the first
three chunks of results I quite above. Why should the case with 10000
iterations always have a standard deviation well over 2% when in all
other cases it is well under 1%. Maybe there's something wrong there
as well.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/cairo/attachments/20061005/d03cfdc6/attachment.pgp