[Mesa-dev] r600g/mesa/gallium performance, whois to blame ?

Fri Nov 12 11:55:08 PST 2010

Hi,

I have been doing some benchmarking lately to try to identify
bottleneck in the Mesa/Gallium/R600 driver. I fear results are not
ones i expected. I would have liked GPU being the bottleneck and thus
additions of new features such as texture tiling or hyper-z would
immediatly boost performance. I mostly used "old" GL application such
quake 3 like rendering (openarena, cake, quake3, ut2004, nexuiz are
among the program i used), i am pretty sure in the end most of them
can reach close to maximum performance without use of advanced GPU
optimization such as hyper-z, because newer hardware such as
R600/R700/Evergreen should have enough raw power for that kind of
rendering.

I have used sysprof (you will find sysprof xml files in above
mentioned archive) to collect timing information on where the CPU is
spending it's time. To minimize GPU impact and avoid having to compare
with pageflipping games were run in windowed mode at 800x600. I would
also like to point out that given the nature of GL, and the layering
of our implementation of it, one should be prudent with sysprof
results. If the top of the stack is inefficient and calls several
times into the hw driver, the hw driver functions might turn at the
top offender but this doesn't necessarily mean that they are the one
to blame (though they share responsabilities ;)).

(b billion, m million, t thousand) (ng no gpu ie no cmd submited to
gpu, np no pipe ie pipe driver call turn into no operation, nb no
kernel buffer bo management ie replaced by malloc, nv47b nv47 nvidia
blob, nv47g nouveau gallium driver)

I used same configuration (with HD3650, GTS7800 or X1950 radeon with
mesa from nov 11). You can download most of the sysprof/results here
(along with some ddx patch to remove throttling and mesa patch showing
how i commented out the pipe driver part):

* vertexrate --------------------------------------------------------

verts/second      immediate    glDrawArrays  VBO_glDrawArrays
r600g ngnpnb          16.2m        11700.0m          16600.0m
r600g ngnp            33.5m        11700.0m          17400.0m
r600g ng              29.2m           58.3m           2200.0m
r600f                 33.5m           95.1m            240.4m
r600g                 29.3m          117.3m            203.8m
r600c                 17.2m           42.2m            121.9m
r500g                 17.5m           31.0m            141.5m
r500c                 14.4m           13.5m             12.5m
r500c ng              20.2m           57.5m           1700.0m
nv47g                 14.5m           55.5m            118.7m
nv47b                 65.8m           68.7m            201.2m

demos/perf/vertexrate shows that the r600g driver is close to fglrx
(sometimes a little bit faster, sometimes a little bit slower). Other
tests I have done make me confident that the vertex path is not the
bottleneck in our pipeline (thought it could still probably be
optimized further). Also it seems that we are suffering from call
overhead (likely TLS or others similar optimization in our GL
dispatching stuff), nvidia is a lot better at facing millions of call.

Also it seems quite clear that gallium driver out perform classic
driver in DrawArrays case. So from pure vertex rate point of view
gallium is better.

* swapbuffer --------------------------------------------------------

in pixel/second 320x240(s) 320x240(scd) 1200x1024(s) 1200x1024(scd)
r600g ngnpnb          797m         647m        1100m           730m
r600g ngnp            795m         651m        1100m           729m
r600g ng              812m         470m        1100m           729m
r600f                1300m         678m        1500m           980m
r600g                 730m         324m        1100m           365m
r600c                 654m         269m         728m           363m
r500g                1100m         535m        2800m          1500m
r500c                 954m         310m        1800m          1400m
r500c ng             1100m         367m        1800m          1800m
nv47g                1100m          30m        5200m           394m
nv47b                2300m        2000m        3700m          2600m
(s swap, scd swap/clear/draw)

demos/perf/swapbuffers pretty much shows that even if dri2 is less
efficient at performing buffer copy/swap than fglrx it's still not the
thing that is slowing us down.

* drawoverhead ------------------------------------------------------

draw call/second     draw only     draw nop sc     draw sc
r600g ngnpnb             1600t           1500t        173t
r600g ngnp               1700t           1600t         69t
r600g ng                  220t            220t         46t
r600f                    4100t           3500t       1300t
r600g                     123t            122t         34t
r600c                      73t             71t         60t
r500g                     235t            236t         80t
r500c                     115t            115t         97t
r500c ng                  171t            171t        135t
nv47g                     496t            471t        121t
nv47b                   10600t           9300t       1200t
(nop sc no op state ie state changed but to same value, sc state
change to different value btw each draw call)

demos/perf/drawoverhead has the most interesting number r600g goes
from 123t(call/sec) when no state change between draw calls to
34t(call/sec) when a state change between draw calls (the fact that we
only draw 4 vertex at each call is not that important as if we trust
vertexrate we are not impacted by the number of vertices we draw). So
state change divides by 3.6 the raw performance of our stacks. I
wanted to know who was to blame for this.

In order to find out which part of the stack is underperforming in
front of state changes I slowly disabled layer starting by the bottom
(which is the only way to do this ;o)). Thus i disabled the command
buffer submission to the GPU (r600g-nogpu) and made sure the driver
still believed things where happening. Drawoverhead state change from
123t(call/sec-r600g) to 220t(call/sec-r600g-nogpu). So the GPU is
slowing down things a bit but not that much, also comparing sysprof
shows that we are spending lot of time in cs ioctl.

Next was to disable the r600g pipe driver, basically turning the
driver into no-op where each call into it is ignored except for
buffer/resource/texture allocations. Drawoverhead state change from
220t(call/sec-r600g-nogpu) to 1700t(call/sec-r600g-nogpu-nopipe).
Obviously the r600g pipe is a CPU intensive task, lot of registers
marshmalling. But the most disturbing fact is that we achieve 24.6
times less draw call per second when there is a state change than when
there is none. Pointing out that the pipe driver is likely not the
only one to blame.

Last was to see if our memory allocation throught gem/ttm was hurting
us. Yes it does (drawoverhead no state change
1600t(call/sec-r600g-nogpu-nopipe-nobo, drawoverhead state change
173t(call/sec-r600g-nogpu-nopipe-nobo). So when we use malloc for
buffer allocation the performances, between no state change and a
state change, drops only by a factor of 9.4. So obviously GPU buffer
allocation is costing us a lot.

sysprof shows that it's the shader constant where we are loosing most
of our bo cpu time. In that path it's the pb_bufmgr_cache mecanism
that are too blame, we are loosing lot of time in gettimeofday. We are
also loosing lot of time in pb_bufmgr_cache bo allocation path (again
for gettimeofday). Those shows too in nexuiz sysprof but they looks
less offencive there as nexuiz usage pattern also suffer from
bottleneck in others part of the code.

Drawoverhead outcome is i believe that gallium is severly
under-performing in front of states changes. If i had to guess i would
say that an improvement of factor n in this would gives an improvement
of ~n for the overall (at least for r600g and likely for r300g too).
Does any one works on this ? Or knows what could be done to improve
this ? I didn't spot any obvious mistake in mesa state tracker. Of
course one could argue that it's the pipe driver which is slow but i
don't it's the only one to blame. Classic driver doesn't fallover in
drawoverhead test, thought classic driver are lot less performant on
this benchmark so maybe bottleneck in classic is also somewhere in
state world.

To me gallium need to be improved to be more efficient at changing
only the smallest number of states and try avoid call into the pipe
driver. I am not saying there is nothings to be done in the pipe
driver but there is a limit on what we can do there and gpu driver is
all about avoiding doing things :o)

Also nvidia nouveau driver shows the same kind of issue on
drawoverhead. I would also like to stress that i am not interested in
making drawoverhead superfast but that i believe drawoverhead exibits
one of the biggest shortcoming of today gallium implementations.

This made me wonders if it would not be good time to start thinking
about doing a new GL2 pure state tracker for gallium (an idea i stole
from Stephane ;o)).
Drawbacks :
	- loosing/missing fixes/GL spec interpretations grown into mesa  over
all the years
	- Intel is still on classic, GL is big.
Advantages :
	- implemented with efficiency straight from the begining
	- less code than mesa ?
	- faster gallium.

To mitigate the drawbacks one might start with EGL2 only which should
be a lot simpler to achieve while steal allowing to test some real app
(like quake3 or
doomIII rendering engine).

* bo allocation -----------------------------------------------------

Another constant pattern i see in all benchmark i done is that a lot
of bo allocation/destruction happen during draw sequency (for shader &
shader constant mostly but also for temporary vertex buffer). Current
solution we have are under performing (see pb_bufmgr* bottleneck i am
talking about above).

Another point is how we handle all those bo, it seems that our current
command submission mecanism are severly slow. If you look at sysprof
you will see that despiste us being 3 times slower we are spending
same amount of time in kernel space than closed source driver, this
surely can't be good.

To address this i want to play a bit with ttm. For instance try
considering every bo as pinned and use a big gtt space (simplify away
bo validation/allocation to be lock less or at least very quick). This
would help to determine the cost (if any) of our current scheme where
we always try to satisfy userspace request which is often to have bo
into vram. I might also try to change the bo mapping by always using
page and add a call to sync vram copy (such changes obviously need
hack into userspace too otherwise they won't work properly).

Also, i think, that we have been bit naive to think that one can
optimize GL stack and make it fast afterward (at least I have been
naive :o)). Efficient/performant GL stack can only be done by
carefully evaluating each step of the way and changing what needs to
be change no matter where in the stack. Which would have mean for us
having unstable kernel API until we are at a point where we see GL
performing reasonably well (I envy nouveau people who are wiser on
this front :o)).

Sorry for the long mail, but i wanted to explain the reasoning behind
my findings. Maybe i am completely wrong and overlooked something, i
hope not.

Cheers,
Jerome Glisse