[Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

Fri Aug 9 06:29:51 PDT 2013

On Fri, Aug 9, 2013 at 10:34 AM, Christian König
<deathsimple at vodafone.de> wrote:
> Am 08.08.2013 21:38, schrieb Alex Deucher:
>
>> On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>
>>> On Thu, Aug 8, 2013 at 6:57 PM, Christian König <deathsimple at vodafone.de>
>>> wrote:
>>>>
>>>> Am 08.08.2013 16:33, schrieb Marek Olšák:
>>>>>
>>>>> On Thu, Aug 8, 2013 at 3:09 PM, Christian König
>>>>> <deathsimple at vodafone.de>
>>>>> wrote:
>>>>>>
>>>>>> Am 08.08.2013 14:38, schrieb Marek Olšák:
>>>>>>
>>>>>>> .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
>>>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>>>
>>>>>>>> Am 08.08.2013 02:20, schrieb Marek Olšák:
>>>>>>>>
>>>>>>>>> FMASK is bound as a separate texture. For every texture, there can
>>>>>>>>> be
>>>>>>>>> an FMASK. Therefore a separate array of resource slots has to be
>>>>>>>>> added.
>>>>>>>>>
>>>>>>>>> This adds a new mechanism for emitting resource descriptors, its
>>>>>>>>> features
>>>>>>>>> are:
>>>>>>>>> - resource descriptors are stored in an ordinary buffer (not in a
>>>>>>>>> CS)
>>>>>>>>
>>>>>>>>
>>>>>>>> Having resource descriptors outside of the CS has two problems that
>>>>>>>> we
>>>>>>>> need
>>>>>>>> to solve first:
>>>>>>>>
>>>>>>>> 1. Fine grained descriptor updates doesn't work, I already tried
>>>>>>>> that.
>>>>>>>> The
>>>>>>>> problem is that unlike previous asics descriptors are now a memory
>>>>>>>> block,
>>>>>>>> so
>>>>>>>> no longer part of the CP context. So when we (for example) have a
>>>>>>>> draw
>>>>>>>> command executing and the next draw command is using new resources
>>>>>>>> for
>>>>>>>> a
>>>>>>>> specific slot we would either block until the first draw command is
>>>>>>>> finished
>>>>>>>> (which is bad for performance) or change the descriptors while they
>>>>>>>> are
>>>>>>>> still in use (which results in VM faults).
>>>>>>>
>>>>>>> So what would the proper solution be here? Do I need to flush some
>>>>>>> caches or would moving the descriptor updates to the constant IB fix
>>>>>>> that?
>>>>>>
>>>>>>
>>>>>> Actually the current implementation worked better than anything else I
>>>>>> tried.
>>>>>>
>>>>>> When you really need the resource descriptors in a separate buffer you
>>>>>> need
>>>>>> to use one buffer for each draw call and always write the full buffer
>>>>>> contents (no partial updates). Flushing anything won't really help
>>>>>> either..
>>>>>>
>>>>>> The only solution I see using one buffer is to block until the last
>>>>>> draw
>>>>>> call is finished with WAIT_REG_MEM, but that would be quite disastrous
>>>>>> for
>>>>>> performance.
>>>>>>
>>>>>>
>>>>>>>> 2. If my understand is correct when they are embedded the
>>>>>>>> descriptors
>>>>>>>> are
>>>>>>>> preloaded into the caches while executing the IB, so to archive the
>>>>>>>> same
>>>>>>>> speed with descriptors outside of the IB you need to add additional
>>>>>>>> commands
>>>>>>>> to the constant IB which is new to SI and we currently doesn't
>>>>>>>> support
>>>>>>>> in
>>>>>>>> the CS interface.
>>>>>>>
>>>>>>> There seems to be support for the constant IB. The CS ioctl chunk ID
>>>>>>> is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
>>>>>>> si_vm_packet3_ce_check. Is there anything missing?
>>>>>>
>>>>>>
>>>>>> The userspace side seems to be missing and except for throwing NOP
>>>>>> packets
>>>>>> into it we never tested it. I know from the closed source side that it
>>>>>> actually was quite tricky for them to get working.
>>>>>>
>>>>>> Additional to that please note that I'm not 100% sure that just
>>>>>> putting
>>>>>> the
>>>>>> descriptors into the IB is really helping here. It was just the most
>>>>>> simplest solution to avoid allocating a new buffer on each draw call.
>>>>>
>>>>> I understand. I don't really need to have resource descriptors in a
>>>>> separate buffer, all I need is these 3 basic features a gallium driver
>>>>> should support:
>>>>> - fine-grained resource updates (mainly for performance, see below)
>>>>> - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
>>>>> - no GPU crash if a shader is using SAMPLER[15] but there are no
>>>>> samplers
>>>>> bound
>>>>>
>>>>> FYI, partial sampler view and sampler state updates are coming to
>>>>> gallium, Brian Paul already has some patches, it's just a matter of
>>>>> time now. Vertex and constant buffer states already support partial
>>>>> updates.
>>>>
>>>>
>>>> That shouldn't be to much off a problem.
>>>>
>>>> Just allocate a state at startup and initialize it with the proper pm4
>>>> commands for 16 samplers, then update the resource descriptors in that
>>>> state
>>>> when we change the bound textures/samplers/views/constants/whatever. All
>>>> we
>>>> need to do then is setting the emitted state to NULL so that it gets
>>>> re-emitted in the next draw command.
>>>
>>> That would re-emit all 16 shader resources even if just one of them
>>> needs to be changed. I was trying to avoid this inefficiency. Is it
>>> really impossible to emit just one resource descriptor and keep the
>>> others unchanged? This is a basic D3D10/11 feature, for example:
>>>
>>> void ID3D11DeviceContext::VSSetShaderResources(
>>>    [in]  UINT StartSlot,
>>>    [in]  UINT NumViews,
>>>    [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
>>> );
>>>
>>> If the constant engine is required to implement this interface
>>> efficiently, then I'd like to work on constant IB support.
>>
>> You'll need to either store them in memory or re-emit them if you
>> store them in the IB.  The CE is mainly there so that it can prime the
>> TC in parallel with the command stream processing.
>
>
> Yeah indeed. The CE is just for prefetching everything into caches and
> doesn't really help here.
>
> The only two options I see is either fully emitting it into the command
> stream whenever anything changes or allocating a new buffer for the
> resources on each new draw call, copying over the old state and then setting
> just the things that changed. Both options have their pro and cons, no idea
> what might be better.
>
> Fact is the resource descriptors are not allowed to change as long as the
> shaders are running.

I think flushing the TC before changing the descriptors should help.
If not, then PS_PARTIAL_FLUSH or some other equivalent of WAIT_UNTIL
should.

Marek