[Mesa-dev] r600g/mesa/gallium performance, whois to blame ?

Fri Nov 12 17:51:13 PST 2010

On Fri, Nov 12, 2010 at 7:46 PM, Marek Olšák <maraeo at gmail.com> wrote:
> Hi Jerome,
>
> Concerning rewriting mesa etc. I am not a fan of complete rewriting. It's
> like throwing years of hard work and ideas and testing and fixing away.
> Sometimes this can get really crazy, resulting in endless rewrites with
> nothing being stable or usable for end users. I think I have seen this
> tendency in you. If you think there are any performance issues in the
> current code, please consider fixing them there, one by one.
>
> Another thing you can do for your testing is to measure the time spent in
> radeon_cs_emit and bo_wait, and compare that to the time spent elsewhere in
> the driver. bo_wait calls can sometimes be optimized away with a smart
> buffer management (keep in mind that all write-only transfers can be
> implemented *without* bo_wait, but you first need resource_copy for pure
> buffers). Concerning radeon_cs_emit, it can be called in another thread so
> as not to cost you anything on multi-core machines (it's not so simple, some
> DRM calls may need to wait until radeon_cs_emit finishes, but you shouldn't
> need such a synchronization if vertices/pixels/commands flow only one way
> i.e. to the GPU).
>
> In r300g, we had the following issue. The driver used to spend too much time
> in pb_bufmgr_cache, especially in the create and map functions and in the
> winsys as well. It turned out the real problem was somewhere entirely else:
> we used u_upload_mgr somewhat naively and that was slowing down the driver a
> lot. However if you had had a look at the profiler results, you wouldn't
> have been able to see any obvious connection between u_upload_mgr and the
> winsys. Eventually the fix turned out to be pretty simple, but my point is
> that profiler data can show you the bottleneck, but not the real cause nor
> will it help you to find the best (or just a good) solution.
>
> Also I think demos/perf/* are quite bad tests if you care about performance
> in real apps. Running some replays from real games under sysprof or
> callgrind will give you more interesting results. Phoronix Test Suite comes
> in handy here as it contains a lot of automatic tests of real games you can
> tear apart and use. :) But not all of them are framerate-independent,
> therefore you cannot always expect callgrind to give you meaningful results.
>
> Marek

Maybe i should have stress that i used demos/perf/* as helper to
understand what was wrong with games (openarena, ut2004, nexuiz,
doomIII, ... are amongs those i looked at). Sysprof on games shows
that we spend hugue amount of time in r600_draw_vbo, time is split btw
r600_context_draw the inefficient shader rebuild,
r600_upload_user_buffers and st_validate_state.

For r600_context_draw beside issue with buffer management
(gettimeofday and all the layer and hop we have to go through to get
info) i don't see much room for improvement beside cutting done the
number of state we are processing at each draw.

Obviously the shader issue is partly addressable by using fetch shader
and again fixing bo allocation overtime.

r600_upload_user_buffers can be optimized again with fixing some of
our bo pointless overhead (or maybe fixing similar issue than the one
you described for r300g).

Finaly st_validate_state can also be improved by fixing the bo stuff,
but i believe this also shows that we are processing more state than
what closed source driver are in front of same rendering.

This is where drawoverhead & vertexrate provide insight. Vertex rate
shows that gallium have no issue in term of vertex throughput. Sadly
drawoverhead shows that we are loosing the battle any time there is a
state change. The difference btw nogpu-nopipe & nogpu-nopipe-nobo
shows that our bo management code inside the pipe driver add an
amazing overhead and likely can be blame for most of the cost. Still
there is limit on what you can optimize in the pipe driver and i
believe being 28 times slower on this benchmark than fglrx is a
testimony to this. Number also i believe shows that mesa/gallium has a
role in this bottleneck. r600g-nogpu-nopipe-nobo goes from 1600t
call/sec to 173t call/sec allmost 10 times slower while closed source
driver are only 3 times slower.

The fact that nouveau driver also has a significant drop also point
toward inefficient state updates. Even more for nouveau driver as it
has far better raw performance in the non state change than r600g or
r300g (r600g is way too slow here 2 times slower than r300g which
itself it 2 times slower than nouveau and i believe nouveau can be
optimized too thus we are likely severly underperforming in the pipe
driver).

Also as i said i well aware of false believe that sysprof might gives
and i stressed that part in my mail. I have try to be carefull in what
i am doing and i believe my analysis is right : vertex throughput is
ok but states change lead to too much overhead (and this overhead is
amplified by the pipe driver inefficiency).

As a note the GLES2 idea was more shooting an idea to try to get
someone interested in such pet project. I am totaly aware of all the
years of work mesa is built on and i wouldn't want to loose it.

Cheers,
Jerome Glisse