[Intel-gfx] X11 performance regressions

Wed May 11 21:49:34 CEST 2011

On Wed, 2011-05-11 at 16:46 +0200, Knut Petersen wrote:
> Yes, I made some mistakes during my first measurements.
> 
> Below find better results. They are made on the same machine,
> with the same kernel, at the same speed, with the same x11perf
> program, absolutely nothing changed.

You don't mention whether the 2d driver varies; I assume it does at
least to the extent of rebuilding for new ABI.  Or libdrm, although
that's really a 1% kind of thing.

> I think the numbers below are quite interesting ...

I still wager they're more about the environment than about the driver
proper, there's just too many weird things going on in your results. For
example:

>   198000.0    0.27   ShmPutImage 10x10 square
>     1570.0    0.46   ShmPutImage 500x500 square
>    21700.0    0.61   ShmPutImage 100x100 square

This is essentially a memcpy benchmark.  Something has to be very wrong
for that much variation to happen, and my guess would be something like
failing to inline memcpy or pick sufficiently macho optimized versions.
I'd be interested to see what your CFLAGS from build.sh ended up being,
relative to what opensuse gives for 'rpm --eval "%{optflags}"'.

One cool thing you can do from memcpy benchmarks like this is
extrapolate a bandwidth number. Your fast numbers are (small tests to
big) 75.5, 828, and 1497 MB/s. Normally one expects some growth in those
numbers for bigger tests, but typically the jump from 10x10 to 100x100
is a bit larger than the jump from 100x100 to 500x500.

So that hints that small-work tests are being choked somehow. Recall
that x11perf does a 1-pixel GetImage periodically in order to guarantee
that results actually hit the framebuffer instead of just being queued
in the command stream, so round-trip performance with the X server does
actually matter. More than that, small-work requests (which take less
time) would be more strongly dominated by round-trip speed than
large-work requests. Given that:

>    15400.0    0.54   GetProperty
>    15500.0    0.54   QueryPointer

is very telling. Those requests do essentially no work, but they are
round-trips, and their throughput is thus bounded mostly by how long it
takes the scheduler to ping-pong between x11perf and the server. A
factor of ~2 drop would lead me to suspect something like one kernel
scheduling the processes on different cores, and the other both on the
same core; two processes splitting 1CPU time with maybe a little cache
warmth between them would intuitively be about half as fast as two
processes each with their own CPU.

Empirical evidence: On the Ironlake laptop on my desk (kernel
2.6.38.3-18.fc15), if I use taskset to bind the X server to CPU0,
running "x11perf -prop -pointer" bound to CPU0 gives:

 300000 trep @   0.0322 msec ( 31100.0/sec): QueryPointer
 300000 trep @   0.0321 msec ( 31200.0/sec): GetProperty

x11perf bound to CPU3 gives:

 600000 trep @   0.0193 msec ( 51900.0/sec): QueryPointer
 600000 trep @   0.0192 msec ( 52200.0/sec): GetProperty

And running it unbound (letting the scheduler decide) gives:

 600000 trep @   0.0198 msec ( 50600.0/sec): QueryPointer
 600000 trep @   0.0208 msec ( 48000.0/sec): GetProperty

I'd be curious to see how you fare with experimenting with taskset.

One set of results that's a little confusing, and thus probably in the
end most enlightening:

>   553000.0    0.24   Copy 10x10 from pixmap to pixmap
>   140000.0    0.86   Copy 10x10 from window to pixmap
>   143000.0    0.88   Copy 10x10 from pixmap to window
>      867.0    0.99   Copy 500x500 from pixmap to pixmap
>      870.0    1.00   Copy 500x500 from window to window
>    19800.0    1.01   Copy 100x100 from pixmap to pixmap
>    19900.0    1.01   Copy 100x100 from pixmap to window
>    20000.0    1.01   Copy 100x100 from window to pixmap
>    19600.0    1.01   Copy 100x100 from window to window
>      851.0    1.01   Copy 500x500 from pixmap to window
>      849.0    1.02   Copy 500x500 from window to pixmap
>    81700.0    1.52   Copy 10x10 from window to window

This _mostly_ makes sense. These are all just varying calls to
XCopyArea, which does not have a reply. The medium and large ops are
approximately identical before and after. The 0.8x results make sense in
the context of scheduling funniness for small-work requests. But the two
outliers are perplexing. I would guess that copywinwin10 got faster due
to some optimization surrounding buffer reuse or flush reduction (you're
always working on the same buffer, so you can do less work), and that
copypixpix10 is operating wholly in host memory for some reason and
therefore hitting the same kind of memcpy issue as in your ShmPutImage
results.

I'll also note that the paths where you're losing hardest are, in the
majority, things that the driver makes no attempt to accelerate
(anything with the word "tiled" or "stippled" involved, for example). I
would tend to chalk that up to something like gcc -O0 before anything
else since you're primarily measuring the efficiency of the software
renderer. I'm actually pretty pleased with the results you've shown, 10%
or better speedup for basically all text ops, about half of window
management ops, and almost all window exposure ops. 

- ajax
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20110511/ca1e6921/attachment.sig>