[Mesa-dev] r600g/mesa/gallium performance, whois to blame ?

Fri Nov 12 19:50:59 PST 2010

Jerome Glisse <j.glisse at gmail.com> writes:

> Hi,
>
>[...]
> optimized further). Also it seems that we are suffering from call
> overhead (likely TLS or others similar optimization in our GL
> dispatching stuff), nvidia is a lot better at facing millions of call.
>
Yeah, the nvidia binary blob is really good at immediate mode, but
that's pretty useless for real world GL games.

>[...]
> In order to find out which part of the stack is underperforming in
> front of state changes I slowly disabled layer starting by the bottom
> (which is the only way to do this ;o)). Thus i disabled the command
> buffer submission to the GPU (r600g-nogpu) and made sure the driver
> still believed things where happening. Drawoverhead state change from
> 123t(call/sec-r600g) to 220t(call/sec-r600g-nogpu). So the GPU is
> slowing down things a bit but not that much, also comparing sysprof
> shows that we are spending lot of time in cs ioctl.
>
In nouveau we also had a little performance problem with our pushbuf
ioctl, larger command buffers helped a lot because that allowed
userspace to pile up a considerable amount of rendering before coming
back to kernel mode (this fix might be completely irrelevant to your
case though, apparently the radeon CS IOCTL is O(n) on the number of
dwords submitted while its nouveau counterpart is O(1)).

> Next was to disable the r600g pipe driver, basically turning the
> driver into no-op where each call into it is ignored except for
> buffer/resource/texture allocations. Drawoverhead state change from
> 220t(call/sec-r600g-nogpu) to 1700t(call/sec-r600g-nogpu-nopipe).
> Obviously the r600g pipe is a CPU intensive task, lot of registers
> marshmalling. But the most disturbing fact is that we achieve 24.6
> times less draw call per second when there is a state change than when
> there is none. Pointing out that the pipe driver is likely not the
> only one to blame.
>
Relative terms are somewhat misleading here, these are absolute
overheads calculated from your results:

  r600g ngnpnb            0.005155 ms/draw
  r600g ngnp              0.013905 ms/draw
  r600g ng                0.017194 ms/draw
  r600g                   0.021282 ms/draw
  nv47g                   0.006248 ms/draw

So, yes, the pipe driver is definitely not the only one to be blamed,
but at least a 75% of the total overhead comes from below the mesa state
tracker.

> Last was to see if our memory allocation throught gem/ttm was hurting
> us. Yes it does (drawoverhead no state change
> 1600t(call/sec-r600g-nogpu-nopipe-nobo, drawoverhead state change
> 173t(call/sec-r600g-nogpu-nopipe-nobo). So when we use malloc for
> buffer allocation the performances, between no state change and a
> state change, drops only by a factor of 9.4. So obviously GPU buffer
> allocation is costing us a lot.
>
The question is why is GEM/TTM costing you anything at all in this
particular case, given working and smart enough buffer suballocation or
caching, "drawoverhead" wouldn't have ever met TTM inside its main loop.

>[...]
> this ? I didn't spot any obvious mistake in mesa state tracker. Of
> course one could argue that it's the pipe driver which is slow but i
> don't it's the only one to blame. Classic driver doesn't fallover in
> drawoverhead test, thought classic driver are lot less performant on
> this benchmark so maybe bottleneck in classic is also somewhere in
> state world.
>
I'm not going to take your r600c results literally because something
else was seriously slowing it down in this test, for comparison I've
repeated the same benchmark with the nouveau classic driver, this is
what I get:
                   draw only       draw nop sc   draw sc     overhead
  nv17 (classic)     1600t           1500t       685.3t      0.000835 ms/draw
  nv17 (blob)        6100t           6100t       303.8t      0.003127 ms/draw

nouveau classic seems *less* affected by state changes than the nvidia
blob, so I wouldn't blame the fact that you're going through mesa
(instead of an alternative non-existent state tracker) for this.

>[...]
> Also, i think, that we have been bit naive to think that one can
> optimize GL stack and make it fast afterward (at least I have been
> naive :o)). Efficient/performant GL stack can only be done by
> carefully evaluating each step of the way and changing what needs to
> be change no matter where in the stack. Which would have mean for us
> having unstable kernel API until we are at a point where we see GL
> performing reasonably well (I envy nouveau people who are wiser on
> this front :o)).
>
They're only unstable in theory, the reality is that Linus will skin any
nouveau dev alive, if he dares to break the kernel API. ;)

> Sorry for the long mail, but i wanted to explain the reasoning behind
> my findings. Maybe i am completely wrong and overlooked something, i
> hope not.
>
> Cheers,
> Jerome Glisse
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 229 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20101113/bb8ad0a4/attachment.pgp>