[Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

Fri Aug 9 11:06:16 PDT 2013

On Fri, Aug 9, 2013 at 3:48 PM, Christian König <deathsimple at vodafone.de> wrote:
> Am 09.08.2013 15:29, schrieb Marek Olšák:
>>
>> On Fri, Aug 9, 2013 at 10:34 AM, Christian König
>> <deathsimple at vodafone.de> wrote:
>>>
>>> Am 08.08.2013 21:38, schrieb Alex Deucher:
>>>
>>>> On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>>
>>>>> On Thu, Aug 8, 2013 at 6:57 PM, Christian König
>>>>> <deathsimple at vodafone.de>
>>>>> wrote:
>>>>>>
>>>>>> Am 08.08.2013 16:33, schrieb Marek Olšák:
>>>>>>>
>>>>>>> On Thu, Aug 8, 2013 at 3:09 PM, Christian König
>>>>>>> <deathsimple at vodafone.de>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 08.08.2013 14:38, schrieb Marek Olšák:
>>>>>>>>
>>>>>>>>> .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
>>>>>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>>>>>
>>>>>>>>>> Am 08.08.2013 02:20, schrieb Marek Olšák:
>>>>>>>>>>
>>>>>>>>>>> FMASK is bound as a separate texture. For every texture, there
>>>>>>>>>>> can
>>>>>>>>>>> be
>>>>>>>>>>> an FMASK. Therefore a separate array of resource slots has to be
>>>>>>>>>>> added.
>>>>>>>>>>>
>>>>>>>>>>> This adds a new mechanism for emitting resource descriptors, its
>>>>>>>>>>> features
>>>>>>>>>>> are:
>>>>>>>>>>> - resource descriptors are stored in an ordinary buffer (not in a
>>>>>>>>>>> CS)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Having resource descriptors outside of the CS has two problems
>>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>> need
>>>>>>>>>> to solve first:
>>>>>>>>>>
>>>>>>>>>> 1. Fine grained descriptor updates doesn't work, I already tried
>>>>>>>>>> that.
>>>>>>>>>> The
>>>>>>>>>> problem is that unlike previous asics descriptors are now a memory
>>>>>>>>>> block,
>>>>>>>>>> so
>>>>>>>>>> no longer part of the CP context. So when we (for example) have a
>>>>>>>>>> draw
>>>>>>>>>> command executing and the next draw command is using new resources
>>>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>> specific slot we would either block until the first draw command
>>>>>>>>>> is
>>>>>>>>>> finished
>>>>>>>>>> (which is bad for performance) or change the descriptors while
>>>>>>>>>> they
>>>>>>>>>> are
>>>>>>>>>> still in use (which results in VM faults).
>>>>>>>>>
>>>>>>>>> So what would the proper solution be here? Do I need to flush some
>>>>>>>>> caches or would moving the descriptor updates to the constant IB
>>>>>>>>> fix
>>>>>>>>> that?
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually the current implementation worked better than anything else
>>>>>>>> I
>>>>>>>> tried.
>>>>>>>>
>>>>>>>> When you really need the resource descriptors in a separate buffer
>>>>>>>> you
>>>>>>>> need
>>>>>>>> to use one buffer for each draw call and always write the full
>>>>>>>> buffer
>>>>>>>> contents (no partial updates). Flushing anything won't really help
>>>>>>>> either..
>>>>>>>>
>>>>>>>> The only solution I see using one buffer is to block until the last
>>>>>>>> draw
>>>>>>>> call is finished with WAIT_REG_MEM, but that would be quite
>>>>>>>> disastrous
>>>>>>>> for
>>>>>>>> performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> 2. If my understand is correct when they are embedded the
>>>>>>>>>> descriptors
>>>>>>>>>> are
>>>>>>>>>> preloaded into the caches while executing the IB, so to archive
>>>>>>>>>> the
>>>>>>>>>> same
>>>>>>>>>> speed with descriptors outside of the IB you need to add
>>>>>>>>>> additional
>>>>>>>>>> commands
>>>>>>>>>> to the constant IB which is new to SI and we currently doesn't
>>>>>>>>>> support
>>>>>>>>>> in
>>>>>>>>>> the CS interface.
>>>>>>>>>
>>>>>>>>> There seems to be support for the constant IB. The CS ioctl chunk
>>>>>>>>> ID
>>>>>>>>> is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
>>>>>>>>> si_vm_packet3_ce_check. Is there anything missing?
>>>>>>>>
>>>>>>>>
>>>>>>>> The userspace side seems to be missing and except for throwing NOP
>>>>>>>> packets
>>>>>>>> into it we never tested it. I know from the closed source side that
>>>>>>>> it
>>>>>>>> actually was quite tricky for them to get working.
>>>>>>>>
>>>>>>>> Additional to that please note that I'm not 100% sure that just
>>>>>>>> putting
>>>>>>>> the
>>>>>>>> descriptors into the IB is really helping here. It was just the most
>>>>>>>> simplest solution to avoid allocating a new buffer on each draw
>>>>>>>> call..
>>>>>>>
>>>>>>> I understand. I don't really need to have resource descriptors in a
>>>>>>> separate buffer, all I need is these 3 basic features a gallium
>>>>>>> driver
>>>>>>> should support:
>>>>>>> - fine-grained resource updates (mainly for performance, see below)
>>>>>>> - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
>>>>>>> - no GPU crash if a shader is using SAMPLER[15] but there are no
>>>>>>> samplers
>>>>>>> bound
>>>>>>>
>>>>>>> FYI, partial sampler view and sampler state updates are coming to
>>>>>>> gallium, Brian Paul already has some patches, it's just a matter of
>>>>>>> time now. Vertex and constant buffer states already support partial
>>>>>>> updates.
>>>>>>
>>>>>>
>>>>>> That shouldn't be to much off a problem.
>>>>>>
>>>>>> Just allocate a state at startup and initialize it with the proper pm4
>>>>>> commands for 16 samplers, then update the resource descriptors in that
>>>>>> state
>>>>>> when we change the bound textures/samplers/views/constants/whatever.
>>>>>> All
>>>>>> we
>>>>>> need to do then is setting the emitted state to NULL so that it gets
>>>>>> re-emitted in the next draw command.
>>>>>
>>>>> That would re-emit all 16 shader resources even if just one of them
>>>>> needs to be changed. I was trying to avoid this inefficiency. Is it
>>>>> really impossible to emit just one resource descriptor and keep the
>>>>> others unchanged? This is a basic D3D10/11 feature, for example:
>>>>>
>>>>> void ID3D11DeviceContext::VSSetShaderResources(
>>>>>     [in]  UINT StartSlot,
>>>>>     [in]  UINT NumViews,
>>>>>     [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
>>>>> );
>>>>>
>>>>> If the constant engine is required to implement this interface
>>>>> efficiently, then I'd like to work on constant IB support.
>>>>
>>>> You'll need to either store them in memory or re-emit them if you
>>>> store them in the IB.  The CE is mainly there so that it can prime the
>>>> TC in parallel with the command stream processing.
>>>
>>>
>>> Yeah indeed. The CE is just for prefetching everything into caches and
>>> doesn't really help here.
>>>
>>> The only two options I see is either fully emitting it into the command
>>> stream whenever anything changes or allocating a new buffer for the
>>> resources on each new draw call, copying over the old state and then
>>> setting
>>> just the things that changed. Both options have their pro and cons, no
>>> idea
>>> what might be better.
>>>
>>> Fact is the resource descriptors are not allowed to change as long as the
>>> shaders are running.
>>
>> I think flushing the TC before changing the descriptors should help.
>> If not, then PS_PARTIAL_FLUSH or some other equivalent of WAIT_UNTIL
>> should.
>
>
> Hui? Why?
>
> The scalar engine of the shaders loads the resource descriptors from memory,
> so as long as the shaders are running we can't modify the descriptors. So
> the only option here is to either have multiple version of the descriptors
> around at the same time or wait for the shaders to finish before modifying
> the descriptors which means draining the shader pipeline quite a bit.
>
> Or am I missing something here?

What if I kept the current emission code, and only allocated a new
buffer at the end of the emit function, copied all descriptors to it
using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The
buffer where the descriptors are updated using WRITE_DATA would act as
a staging buffer only and shaders would always read from the fresh new
copy. Does that sound good to you?

Marek