[Mesa-dev] Path to optimize (moving from create/bind/delete paradgim to set only ?)

Tue Nov 16 13:15:41 PST 2010

On Tue, Nov 16, 2010 at 3:27 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> On 16.11.2010 20:59, Jerome Glisse wrote:
>> On Tue, Nov 16, 2010 at 2:38 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> On 16.11.2010 20:21, Jerome Glisse wrote:
>>>> Hi,
>>>>
>>>> So i looked a bit more at what path we should try to optimize in the
>>>> mesa/gallium/pipe infrastructure. Here are some number gathers from
>>>> games :
>>>> drawcall /     ps constant   vs constant     ps sampler    vs sampler
>>>> doom3            1.45             1.39               9.24              9.86
>>>> nexuiz             6.27             5.98               6.84              7.30
>>>> openarena  2805.64             1.38               1.51              1.54
>>>>
>>>> (value of 1 mean there is a call of this function for every draw call,
>>>> while value of 10 means there is a call to this function every 10 draw
>>>> call, average)
>>>>
>>>> Note that openarena ps constant number is understable as it's fixed GL
>>>> pipeline which is in use here and the pixel shader constant doesn't
>>>> need much change in those case.
>>>>
>>>> So i think clear trend is that there is a lot of constant upload and
>>>> sampler changing (allmost at each draw call for some games) Thus i
>>>> think we want to make sure that we have real fast path for uploading
>>>> constant or changing sampler. I think those path should be change and
>>>> should avoid using some of the gallium infrastructure. For shader
>>>> constant i think best solution is to provide the ptr to program
>>>> constant buffer directly to the pipe driver and let the driver choose
>>>> how it wants to upload constant to the GPU (GPU have different
>>>> capabilities, some can stream constant buffer inside their command
>>>> stream, other can just keep around a pool of buffer into which they
>>>> can memcpy, ...) As there is no common denominator i don't think we
>>>> should go through the pipe buffer allocation and providing a new pipe
>>>> buffer each time.
>>>>
>>>> Optimizing this for r600g allow ~7% increase in games (when draw is
>>>> nop) ~5% (when not submitting to gpu) ~3% when no part of the driver
>>>> is commented. r600g have others bottleneck that tends to minimize the
>>>> gain we can get from such optimization. Patch at
>>>> http://people.freedesktop.org/~glisse/gallium_const_path/
>>>>
>>>> For sampler i don't think we want to create persistant object, we are
>>>> spending precious time building, hashing, searching for similar
>>>> sampler each time in the gallium code, i think best would be to think
>>>> state as use once and forget. That said we can provide helper function
>>>> to pipe driver that wants to be cache sampler (but even for virtual hw
>>>> i don't think this makes sense). I haven't yet implemented a fast path
>>>> for sampler to see how much we can win from that but i will report
>>>> back once i do.
>>>>
>>>> So a more fundamental question here is should we move away from
>>>> persistant state and consider all states (except shader and texture)
>>>> as being too much volatile so that caching any of them doesn't make
>>>> sense from performance point of view. That would mean change lot of
>>>> create/bind/delete interface to simply set interface for the pipe
>>>> driver. This could be seen as a simplification. Anyway i think we
>>>> should really consider moving more toward set than create/bind/delete
>>>> (i loved a lot the create/bind/delete paradigm but it doesn't seems to
>>>> be the one you want with GL, at least from number i gather with some
>>>> games).
>>> Why do you think it's faster to create and use a new state rather than
>>> search in the hash cache and reuse this? I was under the impression
>>> (this being a dx10 paradigm) even hw is quite optimized for this (that
>>> is, you just keep all the state objects on the hw somewhere and switch
>>> between them). Also, what functions did you really see? If things work
>>> as expected, it should be mostly bind, not create/delete.
>>> Now it is certainly possible a driver doesn't make good use of this
>>> (i.e. it really does all the time consuming stuff on bind), but this is
>>> outside the scope of the infrastructure.
>>> It is possible hashing is insufficient (could for instance cause too
>>> many collisions hence need to recreate state object) but the principle
>>> mechanism looks quite sound to me.
>>>
>>> Roland
>>>
>>
>> The create/bin & reuse paradgim is likely good for a directx like api
>> where api put incentive on application to create  and manage
>> efficiently the states it wants to use. But GL, which is i believe the
>> API we should focus on, is a completely different business. From what
>> i am seeing from games, we repeatly see change to shader constant and
>> we repeatly see change to sampler. We might be using a tool small hash
>> or missing opportunity of reuse, i can totaly believe in that. But
>> nonetheless from what i see it's counter productive to try to hash all
>> those states and hope for reuse simply because cost of creating state
>> is too high and the reuse opportunity (even if we improve it) looks
>> too small. Here you have to think about hundred of thousand call per
>> frame and wasting time to try to to find a GL states pattern in
>> application looks doom to failure to me. From what i have seen, the
>> closed source driver directly translate GL state into GPU register
>> value and directly submit a whole register to the GPU, i guess GPU are
>> good at changing state and so that it's useless to burn CPU to try to
>> save GPU time.
>>
>> So yes, it seems there is no reuse and no i don't think we can or it's
>> even worth of trying to fix that.
> I'm really not sure of that. I think one reason dx10 did this is because
> applications mostly did that anyway - at least for subsequent frames
> pretty much all the state they are using is going to be the same one as
> used on the previous frame. This is independent of the apps using d3d9,
> d3d10, or some GL version - just that d3d10 made it explicit, with the
> other apis you have to resubmit all individual state changes.
> Now, there certainly is some overhead for trying to find some previously
> used state (which is of course not necessary with d3d10). But it should
> be small - provided it actually works. If you don't see reuse maybe the
> hashing stuff might need improvements (larger hash being the obvious but
> not the only one).
> I still don't see why hashing and reusing state would be more expensive
> than recalculating hw state. Also, this gives you the opportunity to
> store all the precalculated state on the hw, without the need to
> retransmit (you just need to point the hw to pick up the state you
> already have) - which should also make things faster.
> Hashing should be cheap.
>
> Roland
>

There is also a very important variable here, all GPU the different
GPU doesn't pull all the gallium state in the same register, and that
forbid state caching. There is no way you can find a common
denominator because none of us control where next generation of GPU
states will endup. So caching state is just not doable with gallium
from pipe driver point of view (it's not all black but the set of
state that you can properly cache on one GPU is different on another
GPU).

For r3xx/r5xx/r6xx/r7xx/evergreen we have a bunch of registers that
have field which depends from several different gallium states. In
r600g i ended up using oring and masking a lot but it turns out this
is a wrong idea, i have been banging my head against all hard surfaces
i could found lately to try to come up with a way to fix r600g
overhead (i am ignoring shader overhead which is more easily fixable
with a lot of typing), and so far my brain come up empty, there is
simply too much register that changes that what ever i do i can't make
it any faster (i can win a couple % only). I am in the process of
gathering data on this but fear that design of r600g is just not the
good one (i am kind of a naive person so i always think that the world
will end up being marvellous but i am often disappointed ;))

My other point is that we are talking about thousand of call per
frame, and all the small overhead add up very quickly here. When i
look at sysprof i see tons of 0.2% here and there and this is the sum
of all this that make it slower than what it could be. So no matter
how much i love the create once and reuse idea (i do love it i swear
:)) it just not a good one if you don't the application doing the
reuse for you and you have to scan for reuse pattern and thus burn
those additional CPU time for it (once again i agree hash + search
should be chip but it's just that we have to do it way too much).

Cheers,
Jerome