[Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

Thu Aug 8 12:38:28 PDT 2013

On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák <maraeo at gmail.com> wrote:
> On Thu, Aug 8, 2013 at 6:57 PM, Christian König <deathsimple at vodafone.de> wrote:
>> Am 08.08.2013 16:33, schrieb Marek Olšák:
>>>
>>> On Thu, Aug 8, 2013 at 3:09 PM, Christian König <deathsimple at vodafone.de>
>>> wrote:
>>>>
>>>> Am 08.08.2013 14:38, schrieb Marek Olšák:
>>>>
>>>>> .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>
>>>>>> Am 08.08.2013 02:20, schrieb Marek Olšák:
>>>>>>
>>>>>>> FMASK is bound as a separate texture. For every texture, there can be
>>>>>>> an FMASK. Therefore a separate array of resource slots has to be
>>>>>>> added.
>>>>>>>
>>>>>>> This adds a new mechanism for emitting resource descriptors, its
>>>>>>> features
>>>>>>> are:
>>>>>>> - resource descriptors are stored in an ordinary buffer (not in a CS)
>>>>>>
>>>>>>
>>>>>> Having resource descriptors outside of the CS has two problems that we
>>>>>> need
>>>>>> to solve first:
>>>>>>
>>>>>> 1. Fine grained descriptor updates doesn't work, I already tried that.
>>>>>> The
>>>>>> problem is that unlike previous asics descriptors are now a memory
>>>>>> block,
>>>>>> so
>>>>>> no longer part of the CP context. So when we (for example) have a draw
>>>>>> command executing and the next draw command is using new resources for
>>>>>> a
>>>>>> specific slot we would either block until the first draw command is
>>>>>> finished
>>>>>> (which is bad for performance) or change the descriptors while they are
>>>>>> still in use (which results in VM faults).
>>>>>
>>>>> So what would the proper solution be here? Do I need to flush some
>>>>> caches or would moving the descriptor updates to the constant IB fix
>>>>> that?
>>>>
>>>>
>>>> Actually the current implementation worked better than anything else I
>>>> tried.
>>>>
>>>> When you really need the resource descriptors in a separate buffer you
>>>> need
>>>> to use one buffer for each draw call and always write the full buffer
>>>> contents (no partial updates). Flushing anything won't really help
>>>> either..
>>>>
>>>> The only solution I see using one buffer is to block until the last draw
>>>> call is finished with WAIT_REG_MEM, but that would be quite disastrous
>>>> for
>>>> performance.
>>>>
>>>>
>>>>>> 2. If my understand is correct when they are embedded the descriptors
>>>>>> are
>>>>>> preloaded into the caches while executing the IB, so to archive the
>>>>>> same
>>>>>> speed with descriptors outside of the IB you need to add additional
>>>>>> commands
>>>>>> to the constant IB which is new to SI and we currently doesn't support
>>>>>> in
>>>>>> the CS interface.
>>>>>
>>>>> There seems to be support for the constant IB. The CS ioctl chunk ID
>>>>> is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
>>>>> si_vm_packet3_ce_check. Is there anything missing?
>>>>
>>>>
>>>> The userspace side seems to be missing and except for throwing NOP
>>>> packets
>>>> into it we never tested it. I know from the closed source side that it
>>>> actually was quite tricky for them to get working.
>>>>
>>>> Additional to that please note that I'm not 100% sure that just putting
>>>> the
>>>> descriptors into the IB is really helping here. It was just the most
>>>> simplest solution to avoid allocating a new buffer on each draw call.
>>>
>>> I understand. I don't really need to have resource descriptors in a
>>> separate buffer, all I need is these 3 basic features a gallium driver
>>> should support:
>>> - fine-grained resource updates (mainly for performance, see below)
>>> - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
>>> - no GPU crash if a shader is using SAMPLER[15] but there are no samplers
>>> bound
>>>
>>> FYI, partial sampler view and sampler state updates are coming to
>>> gallium, Brian Paul already has some patches, it's just a matter of
>>> time now. Vertex and constant buffer states already support partial
>>> updates.
>>
>>
>> That shouldn't be to much off a problem.
>>
>> Just allocate a state at startup and initialize it with the proper pm4
>> commands for 16 samplers, then update the resource descriptors in that state
>> when we change the bound textures/samplers/views/constants/whatever. All we
>> need to do then is setting the emitted state to NULL so that it gets
>> re-emitted in the next draw command.
>
> That would re-emit all 16 shader resources even if just one of them
> needs to be changed. I was trying to avoid this inefficiency. Is it
> really impossible to emit just one resource descriptor and keep the
> others unchanged? This is a basic D3D10/11 feature, for example:
>
> void ID3D11DeviceContext::VSSetShaderResources(
>   [in]  UINT StartSlot,
>   [in]  UINT NumViews,
>   [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
> );
>
> If the constant engine is required to implement this interface
> efficiently, then I'd like to work on constant IB support.

You'll need to either store them in memory or re-emit them if you
store them in the IB.  The CE is mainly there so that it can prime the
TC in parallel with the command stream processing.

Alex