[Mesa-dev] r600g/mesa/gallium performance, whois to blame ?
Marek Olšák
maraeo at gmail.com
Fri Nov 12 16:46:56 PST 2010
Hi Jerome,
Concerning rewriting mesa etc. I am not a fan of complete rewriting. It's
like throwing years of hard work and ideas and testing and fixing away.
Sometimes this can get really crazy, resulting in endless rewrites with
nothing being stable or usable for end users. I think I have seen this
tendency in you. If you think there are any performance issues in the
current code, please consider fixing them there, one by one.
Another thing you can do for your testing is to measure the time spent in
radeon_cs_emit and bo_wait, and compare that to the time spent elsewhere in
the driver. bo_wait calls can sometimes be optimized away with a smart
buffer management (keep in mind that all write-only transfers can be
implemented *without* bo_wait, but you first need resource_copy for pure
buffers). Concerning radeon_cs_emit, it can be called in another thread so
as not to cost you anything on multi-core machines (it's not so simple, some
DRM calls may need to wait until radeon_cs_emit finishes, but you shouldn't
need such a synchronization if vertices/pixels/commands flow only one way
i.e. to the GPU).
In r300g, we had the following issue. The driver used to spend too much time
in pb_bufmgr_cache, especially in the create and map functions and in the
winsys as well. It turned out the real problem was somewhere entirely else:
we used u_upload_mgr somewhat naively and that was slowing down the driver a
lot. However if you had had a look at the profiler results, you wouldn't
have been able to see any obvious connection between u_upload_mgr and the
winsys. Eventually the fix turned out to be pretty simple, but my point is
that profiler data can show you the bottleneck, but not the real cause nor
will it help you to find the best (or just a good) solution.
Also I think demos/perf/* are quite bad tests if you care about performance
in real apps. Running some replays from real games under sysprof or
callgrind will give you more interesting results. Phoronix Test Suite comes
in handy here as it contains a lot of automatic tests of real games you can
tear apart and use. :) But not all of them are framerate-independent,
therefore you cannot always expect callgrind to give you meaningful results.
Marek
On Fri, Nov 12, 2010 at 8:55 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
> Hi,
>
> I have been doing some benchmarking lately to try to identify
> bottleneck in the Mesa/Gallium/R600 driver. I fear results are not
> ones i expected. I would have liked GPU being the bottleneck and thus
> additions of new features such as texture tiling or hyper-z would
> immediatly boost performance. I mostly used "old" GL application such
> quake 3 like rendering (openarena, cake, quake3, ut2004, nexuiz are
> among the program i used), i am pretty sure in the end most of them
> can reach close to maximum performance without use of advanced GPU
> optimization such as hyper-z, because newer hardware such as
> R600/R700/Evergreen should have enough raw power for that kind of
> rendering.
>
> I have used sysprof (you will find sysprof xml files in above
> mentioned archive) to collect timing information on where the CPU is
> spending it's time. To minimize GPU impact and avoid having to compare
> with pageflipping games were run in windowed mode at 800x600. I would
> also like to point out that given the nature of GL, and the layering
> of our implementation of it, one should be prudent with sysprof
> results. If the top of the stack is inefficient and calls several
> times into the hw driver, the hw driver functions might turn at the
> top offender but this doesn't necessarily mean that they are the one
> to blame (though they share responsabilities ;)).
>
> (b billion, m million, t thousand) (ng no gpu ie no cmd submited to
> gpu, np no pipe ie pipe driver call turn into no operation, nb no
> kernel buffer bo management ie replaced by malloc, nv47b nv47 nvidia
> blob, nv47g nouveau gallium driver)
>
> I used same configuration (with HD3650, GTS7800 or X1950 radeon with
> mesa from nov 11). You can download most of the sysprof/results here
> (along with some ddx patch to remove throttling and mesa patch showing
> how i commented out the pipe driver part):
>
>
>
> * vertexrate --------------------------------------------------------
>
> verts/second immediate glDrawArrays VBO_glDrawArrays
> r600g ngnpnb 16.2m 11700.0m 16600.0m
> r600g ngnp 33.5m 11700.0m 17400.0m
> r600g ng 29.2m 58.3m 2200.0m
> r600f 33.5m 95.1m 240.4m
> r600g 29.3m 117.3m 203.8m
> r600c 17.2m 42.2m 121.9m
> r500g 17.5m 31.0m 141.5m
> r500c 14.4m 13.5m 12.5m
> r500c ng 20.2m 57.5m 1700.0m
> nv47g 14.5m 55.5m 118.7m
> nv47b 65.8m 68.7m 201.2m
>
> demos/perf/vertexrate shows that the r600g driver is close to fglrx
> (sometimes a little bit faster, sometimes a little bit slower). Other
> tests I have done make me confident that the vertex path is not the
> bottleneck in our pipeline (thought it could still probably be
> optimized further). Also it seems that we are suffering from call
> overhead (likely TLS or others similar optimization in our GL
> dispatching stuff), nvidia is a lot better at facing millions of call.
>
> Also it seems quite clear that gallium driver out perform classic
> driver in DrawArrays case. So from pure vertex rate point of view
> gallium is better.
>
>
> * swapbuffer --------------------------------------------------------
>
> in pixel/second 320x240(s) 320x240(scd) 1200x1024(s) 1200x1024(scd)
> r600g ngnpnb 797m 647m 1100m 730m
> r600g ngnp 795m 651m 1100m 729m
> r600g ng 812m 470m 1100m 729m
> r600f 1300m 678m 1500m 980m
> r600g 730m 324m 1100m 365m
> r600c 654m 269m 728m 363m
> r500g 1100m 535m 2800m 1500m
> r500c 954m 310m 1800m 1400m
> r500c ng 1100m 367m 1800m 1800m
> nv47g 1100m 30m 5200m 394m
> nv47b 2300m 2000m 3700m 2600m
> (s swap, scd swap/clear/draw)
>
> demos/perf/swapbuffers pretty much shows that even if dri2 is less
> efficient at performing buffer copy/swap than fglrx it's still not the
> thing that is slowing us down.
>
> * drawoverhead ------------------------------------------------------
>
> draw call/second draw only draw nop sc draw sc
> r600g ngnpnb 1600t 1500t 173t
> r600g ngnp 1700t 1600t 69t
> r600g ng 220t 220t 46t
> r600f 4100t 3500t 1300t
> r600g 123t 122t 34t
> r600c 73t 71t 60t
> r500g 235t 236t 80t
> r500c 115t 115t 97t
> r500c ng 171t 171t 135t
> nv47g 496t 471t 121t
> nv47b 10600t 9300t 1200t
> (nop sc no op state ie state changed but to same value, sc state
> change to different value btw each draw call)
>
>
> demos/perf/drawoverhead has the most interesting number r600g goes
> from 123t(call/sec) when no state change between draw calls to
> 34t(call/sec) when a state change between draw calls (the fact that we
> only draw 4 vertex at each call is not that important as if we trust
> vertexrate we are not impacted by the number of vertices we draw). So
> state change divides by 3.6 the raw performance of our stacks. I
> wanted to know who was to blame for this.
>
> In order to find out which part of the stack is underperforming in
> front of state changes I slowly disabled layer starting by the bottom
> (which is the only way to do this ;o)). Thus i disabled the command
> buffer submission to the GPU (r600g-nogpu) and made sure the driver
> still believed things where happening. Drawoverhead state change from
> 123t(call/sec-r600g) to 220t(call/sec-r600g-nogpu). So the GPU is
> slowing down things a bit but not that much, also comparing sysprof
> shows that we are spending lot of time in cs ioctl.
>
> Next was to disable the r600g pipe driver, basically turning the
> driver into no-op where each call into it is ignored except for
> buffer/resource/texture allocations. Drawoverhead state change from
> 220t(call/sec-r600g-nogpu) to 1700t(call/sec-r600g-nogpu-nopipe).
> Obviously the r600g pipe is a CPU intensive task, lot of registers
> marshmalling. But the most disturbing fact is that we achieve 24.6
> times less draw call per second when there is a state change than when
> there is none. Pointing out that the pipe driver is likely not the
> only one to blame.
>
> Last was to see if our memory allocation throught gem/ttm was hurting
> us. Yes it does (drawoverhead no state change
> 1600t(call/sec-r600g-nogpu-nopipe-nobo, drawoverhead state change
> 173t(call/sec-r600g-nogpu-nopipe-nobo). So when we use malloc for
> buffer allocation the performances, between no state change and a
> state change, drops only by a factor of 9.4. So obviously GPU buffer
> allocation is costing us a lot.
>
> sysprof shows that it's the shader constant where we are loosing most
> of our bo cpu time. In that path it's the pb_bufmgr_cache mecanism
> that are too blame, we are loosing lot of time in gettimeofday. We are
> also loosing lot of time in pb_bufmgr_cache bo allocation path (again
> for gettimeofday). Those shows too in nexuiz sysprof but they looks
> less offencive there as nexuiz usage pattern also suffer from
> bottleneck in others part of the code.
>
> Drawoverhead outcome is i believe that gallium is severly
> under-performing in front of states changes. If i had to guess i would
> say that an improvement of factor n in this would gives an improvement
> of ~n for the overall (at least for r600g and likely for r300g too).
> Does any one works on this ? Or knows what could be done to improve
> this ? I didn't spot any obvious mistake in mesa state tracker. Of
> course one could argue that it's the pipe driver which is slow but i
> don't it's the only one to blame. Classic driver doesn't fallover in
> drawoverhead test, thought classic driver are lot less performant on
> this benchmark so maybe bottleneck in classic is also somewhere in
> state world.
>
> To me gallium need to be improved to be more efficient at changing
> only the smallest number of states and try avoid call into the pipe
> driver. I am not saying there is nothings to be done in the pipe
> driver but there is a limit on what we can do there and gpu driver is
> all about avoiding doing things :o)
>
> Also nvidia nouveau driver shows the same kind of issue on
> drawoverhead. I would also like to stress that i am not interested in
> making drawoverhead superfast but that i believe drawoverhead exibits
> one of the biggest shortcoming of today gallium implementations.
>
> This made me wonders if it would not be good time to start thinking
> about doing a new GL2 pure state tracker for gallium (an idea i stole
> from Stephane ;o)).
> Drawbacks :
> - loosing/missing fixes/GL spec interpretations grown into mesa
> over
> all the years
> - Intel is still on classic, GL is big.
> Advantages :
> - implemented with efficiency straight from the begining
> - less code than mesa ?
> - faster gallium.
>
> To mitigate the drawbacks one might start with EGL2 only which should
> be a lot simpler to achieve while steal allowing to test some real app
> (like quake3 or
> doomIII rendering engine).
>
> * bo allocation -----------------------------------------------------
>
> Another constant pattern i see in all benchmark i done is that a lot
> of bo allocation/destruction happen during draw sequency (for shader &
> shader constant mostly but also for temporary vertex buffer). Current
> solution we have are under performing (see pb_bufmgr* bottleneck i am
> talking about above).
>
> Another point is how we handle all those bo, it seems that our current
> command submission mecanism are severly slow. If you look at sysprof
> you will see that despiste us being 3 times slower we are spending
> same amount of time in kernel space than closed source driver, this
> surely can't be good.
>
> To address this i want to play a bit with ttm. For instance try
> considering every bo as pinned and use a big gtt space (simplify away
> bo validation/allocation to be lock less or at least very quick). This
> would help to determine the cost (if any) of our current scheme where
> we always try to satisfy userspace request which is often to have bo
> into vram. I might also try to change the bo mapping by always using
> page and add a call to sync vram copy (such changes obviously need
> hack into userspace too otherwise they won't work properly).
>
> Also, i think, that we have been bit naive to think that one can
> optimize GL stack and make it fast afterward (at least I have been
> naive :o)). Efficient/performant GL stack can only be done by
> carefully evaluating each step of the way and changing what needs to
> be change no matter where in the stack. Which would have mean for us
> having unstable kernel API until we are at a point where we see GL
> performing reasonably well (I envy nouveau people who are wiser on
> this front :o)).
>
> Sorry for the long mail, but i wanted to explain the reasoning behind
> my findings. Maybe i am completely wrong and overlooked something, i
> hope not.
>
> Cheers,
> Jerome Glisse
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20101113/9c1cb630/attachment-0001.html>
More information about the mesa-dev
mailing list