[Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

Fri Aug 9 06:48:03 PDT 2013

Am 09.08.2013 15:29, schrieb Marek Olšák:
> On Fri, Aug 9, 2013 at 10:34 AM, Christian König
> <deathsimple at vodafone.de> wrote:
>> Am 08.08.2013 21:38, schrieb Alex Deucher:
>>
>>> On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>> On Thu, Aug 8, 2013 at 6:57 PM, Christian König <deathsimple at vodafone.de>
>>>> wrote:
>>>>> Am 08.08.2013 16:33, schrieb Marek Olšák:
>>>>>> On Thu, Aug 8, 2013 at 3:09 PM, Christian König
>>>>>> <deathsimple at vodafone.de>
>>>>>> wrote:
>>>>>>> Am 08.08.2013 14:38, schrieb Marek Olšák:
>>>>>>>
>>>>>>>> .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
>>>>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>>>> Am 08.08.2013 02:20, schrieb Marek Olšák:
>>>>>>>>>
>>>>>>>>>> FMASK is bound as a separate texture. For every texture, there can
>>>>>>>>>> be
>>>>>>>>>> an FMASK. Therefore a separate array of resource slots has to be
>>>>>>>>>> added.
>>>>>>>>>>
>>>>>>>>>> This adds a new mechanism for emitting resource descriptors, its
>>>>>>>>>> features
>>>>>>>>>> are:
>>>>>>>>>> - resource descriptors are stored in an ordinary buffer (not in a
>>>>>>>>>> CS)
>>>>>>>>>
>>>>>>>>> Having resource descriptors outside of the CS has two problems that
>>>>>>>>> we
>>>>>>>>> need
>>>>>>>>> to solve first:
>>>>>>>>>
>>>>>>>>> 1. Fine grained descriptor updates doesn't work, I already tried
>>>>>>>>> that.
>>>>>>>>> The
>>>>>>>>> problem is that unlike previous asics descriptors are now a memory
>>>>>>>>> block,
>>>>>>>>> so
>>>>>>>>> no longer part of the CP context. So when we (for example) have a
>>>>>>>>> draw
>>>>>>>>> command executing and the next draw command is using new resources
>>>>>>>>> for
>>>>>>>>> a
>>>>>>>>> specific slot we would either block until the first draw command is
>>>>>>>>> finished
>>>>>>>>> (which is bad for performance) or change the descriptors while they
>>>>>>>>> are
>>>>>>>>> still in use (which results in VM faults).
>>>>>>>> So what would the proper solution be here? Do I need to flush some
>>>>>>>> caches or would moving the descriptor updates to the constant IB fix
>>>>>>>> that?
>>>>>>>
>>>>>>> Actually the current implementation worked better than anything else I
>>>>>>> tried.
>>>>>>>
>>>>>>> When you really need the resource descriptors in a separate buffer you
>>>>>>> need
>>>>>>> to use one buffer for each draw call and always write the full buffer
>>>>>>> contents (no partial updates). Flushing anything won't really help
>>>>>>> either..
>>>>>>>
>>>>>>> The only solution I see using one buffer is to block until the last
>>>>>>> draw
>>>>>>> call is finished with WAIT_REG_MEM, but that would be quite disastrous
>>>>>>> for
>>>>>>> performance.
>>>>>>>
>>>>>>>
>>>>>>>>> 2. If my understand is correct when they are embedded the
>>>>>>>>> descriptors
>>>>>>>>> are
>>>>>>>>> preloaded into the caches while executing the IB, so to archive the
>>>>>>>>> same
>>>>>>>>> speed with descriptors outside of the IB you need to add additional
>>>>>>>>> commands
>>>>>>>>> to the constant IB which is new to SI and we currently doesn't
>>>>>>>>> support
>>>>>>>>> in
>>>>>>>>> the CS interface.
>>>>>>>> There seems to be support for the constant IB. The CS ioctl chunk ID
>>>>>>>> is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
>>>>>>>> si_vm_packet3_ce_check. Is there anything missing?
>>>>>>>
>>>>>>> The userspace side seems to be missing and except for throwing NOP
>>>>>>> packets
>>>>>>> into it we never tested it. I know from the closed source side that it
>>>>>>> actually was quite tricky for them to get working.
>>>>>>>
>>>>>>> Additional to that please note that I'm not 100% sure that just
>>>>>>> putting
>>>>>>> the
>>>>>>> descriptors into the IB is really helping here. It was just the most
>>>>>>> simplest solution to avoid allocating a new buffer on each draw call..
>>>>>> I understand. I don't really need to have resource descriptors in a
>>>>>> separate buffer, all I need is these 3 basic features a gallium driver
>>>>>> should support:
>>>>>> - fine-grained resource updates (mainly for performance, see below)
>>>>>> - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
>>>>>> - no GPU crash if a shader is using SAMPLER[15] but there are no
>>>>>> samplers
>>>>>> bound
>>>>>>
>>>>>> FYI, partial sampler view and sampler state updates are coming to
>>>>>> gallium, Brian Paul already has some patches, it's just a matter of
>>>>>> time now. Vertex and constant buffer states already support partial
>>>>>> updates.
>>>>>
>>>>> That shouldn't be to much off a problem.
>>>>>
>>>>> Just allocate a state at startup and initialize it with the proper pm4
>>>>> commands for 16 samplers, then update the resource descriptors in that
>>>>> state
>>>>> when we change the bound textures/samplers/views/constants/whatever. All
>>>>> we
>>>>> need to do then is setting the emitted state to NULL so that it gets
>>>>> re-emitted in the next draw command.
>>>> That would re-emit all 16 shader resources even if just one of them
>>>> needs to be changed. I was trying to avoid this inefficiency. Is it
>>>> really impossible to emit just one resource descriptor and keep the
>>>> others unchanged? This is a basic D3D10/11 feature, for example:
>>>>
>>>> void ID3D11DeviceContext::VSSetShaderResources(
>>>>     [in]  UINT StartSlot,
>>>>     [in]  UINT NumViews,
>>>>     [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
>>>> );
>>>>
>>>> If the constant engine is required to implement this interface
>>>> efficiently, then I'd like to work on constant IB support.
>>> You'll need to either store them in memory or re-emit them if you
>>> store them in the IB.  The CE is mainly there so that it can prime the
>>> TC in parallel with the command stream processing.
>>
>> Yeah indeed. The CE is just for prefetching everything into caches and
>> doesn't really help here.
>>
>> The only two options I see is either fully emitting it into the command
>> stream whenever anything changes or allocating a new buffer for the
>> resources on each new draw call, copying over the old state and then setting
>> just the things that changed. Both options have their pro and cons, no idea
>> what might be better.
>>
>> Fact is the resource descriptors are not allowed to change as long as the
>> shaders are running.
> I think flushing the TC before changing the descriptors should help.
> If not, then PS_PARTIAL_FLUSH or some other equivalent of WAIT_UNTIL
> should.

Hui? Why?

The scalar engine of the shaders loads the resource descriptors from 
memory, so as long as the shaders are running we can't modify the 
descriptors. So the only option here is to either have multiple version 
of the descriptors around at the same time or wait for the shaders to 
finish before modifying the descriptors which means draining the shader 
pipeline quite a bit.

Or am I missing something here?

Christian.