[Mesa-dev] r600g/mesa/gallium performance, whois to blame ?

Fri Nov 12 20:32:41 PST 2010

On Fri, Nov 12, 2010 at 10:50 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Jerome Glisse <j.glisse at gmail.com> writes:
>
>> Hi,
>>
>>[...]
>> In order to find out which part of the stack is underperforming in
>> front of state changes I slowly disabled layer starting by the bottom
>> (which is the only way to do this ;o)). Thus i disabled the command
>> buffer submission to the GPU (r600g-nogpu) and made sure the driver
>> still believed things where happening. Drawoverhead state change from
>> 123t(call/sec-r600g) to 220t(call/sec-r600g-nogpu). So the GPU is
>> slowing down things a bit but not that much, also comparing sysprof
>> shows that we are spending lot of time in cs ioctl.
>>
> In nouveau we also had a little performance problem with our pushbuf
> ioctl, larger command buffers helped a lot because that allowed
> userspace to pile up a considerable amount of rendering before coming
> back to kernel mode (this fix might be completely irrelevant to your
> case though, apparently the radeon CS IOCTL is O(n) on the number of
> dwords submitted while its nouveau counterpart is O(1)).
>

Yes, sadly for us kernel is our next bottleneck but i don't think we
are hitting it yet.

>> Next was to disable the r600g pipe driver, basically turning the
>> driver into no-op where each call into it is ignored except for
>> buffer/resource/texture allocations. Drawoverhead state change from
>> 220t(call/sec-r600g-nogpu) to 1700t(call/sec-r600g-nogpu-nopipe).
>> Obviously the r600g pipe is a CPU intensive task, lot of registers
>> marshmalling. But the most disturbing fact is that we achieve 24.6
>> times less draw call per second when there is a state change than when
>> there is none. Pointing out that the pipe driver is likely not the
>> only one to blame.
>>
> Relative terms are somewhat misleading here, these are absolute
> overheads calculated from your results:
>
>  r600g ngnpnb            0.005155 ms/draw
>  r600g ngnp              0.013905 ms/draw
>  r600g ng                0.017194 ms/draw
>  r600g                   0.021282 ms/draw
>  nv47g                   0.006248 ms/draw
>
> So, yes, the pipe driver is definitely not the only one to be blamed,
> but at least a 75% of the total overhead comes from below the mesa state
> tracker.
>

Yes, i described the r600g pipe issues in my reply to Marek

>> Last was to see if our memory allocation throught gem/ttm was hurting
>> us. Yes it does (drawoverhead no state change
>> 1600t(call/sec-r600g-nogpu-nopipe-nobo, drawoverhead state change
>> 173t(call/sec-r600g-nogpu-nopipe-nobo). So when we use malloc for
>> buffer allocation the performances, between no state change and a
>> state change, drops only by a factor of 9.4. So obviously GPU buffer
>> allocation is costing us a lot.
>>
> The question is why is GEM/TTM costing you anything at all in this
> particular case, given working and smart enough buffer suballocation or
> caching, "drawoverhead" wouldn't have ever met TTM inside its main loop.
>

According to sysprof most of the overhead is in pb_bufmgr_* helpers iirc.

>>[...]
>> this ? I didn't spot any obvious mistake in mesa state tracker. Of
>> course one could argue that it's the pipe driver which is slow but i
>> don't it's the only one to blame. Classic driver doesn't fallover in
>> drawoverhead test, thought classic driver are lot less performant on
>> this benchmark so maybe bottleneck in classic is also somewhere in
>> state world.
>>
> I'm not going to take your r600c results literally because something
> else was seriously slowing it down in this test, for comparison I've
> repeated the same benchmark with the nouveau classic driver, this is
> what I get:
>                   draw only       draw nop sc   draw sc     overhead
>  nv17 (classic)     1600t           1500t       685.3t      0.000835 ms/draw
>  nv17 (blob)        6100t           6100t       303.8t      0.003127 ms/draw
>
> nouveau classic seems *less* affected by state changes than the nvidia
> blob, so I wouldn't blame the fact that you're going through mesa
> (instead of an alternative non-existent state tracker) for this.
>

I think r600c is just a bit too naive and so it end up being very
expensive to change any states with it. But i haven't took a closer
look. I don't think we should look too much at relative cost of
changing state. I think fglrx optimized the function call cost just
enough so that it didn't impact performances, while nvidia did go nuts
and over optimized function call overhead. Thus i think target should
be more about making sure core mesa + gallium with noop pipe driver
should be able to keep up at 500t draw call/sec when states change
occur (of course this could vary depending on which states change) and
not 173t call/sec.

Cheers,
Jerome Glisse