[Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

Mon Jul 24 17:40:42 UTC 2017

Some additional notes:

1) si_release_bindless_descriptors can be inlined.

2) Sampler slots have size = 16*4 and image slots have size = 8*4 in
your patch. The addressing is also done with a multiple of the size,
which allows creating a sampler and an image in the same slot such
that one of them overwrites the other.

create_bindless_sampler(): slot = 1, address = 1*16 = 16
create_bindless_image(): slot = 2, address = 2*8 = 16

The rest looks OK.

Marek

On Wed, Jul 19, 2017 at 9:14 AM, Samuel Pitoiset
<samuel.pitoiset at gmail.com> wrote:
>
>
> On 07/19/2017 12:01 AM, Marek Olšák wrote:
>>
>> On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>> wrote:
>>>
>>> Hi Samuel,
>>>
>>> On 07.07.2017 03:45, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:
>>>>>
>>>>>
>>>>> On 04.07.2017 15:05, Samuel Pitoiset wrote:
>>>>>>
>>>>>>
>>>>>> Using VRAM address as bindless handles is not a good idea because
>>>>>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
>>>>>> because it has no information about the pointer.
>>>>>>
>>>>>> Instead, use slots indexes like the existing descriptors.
>>>>>>
>>>>>> This improves performance with DOW3 by +7%.
>>>>>
>>>>>
>>>>>
>>>>> Wow.
>>>>>
>>>>> The thing is, burning a pair of user SGPRs for this seems a bit
>>>>> overkill,
>>>>> especially since it also hurts apps that don't use bindless at all.
>>>>>
>>>>> Do you have some examples of how LLVM fails here? Could we perhaps
>>>>> avoid
>>>>> most of the performance issues by casting 0 to an appropriate pointer
>>>>> type
>>>>> once, and then using the bindless handle as an index relative to that
>>>>> pointer?
>>>>
>>>>
>>>>
>>>> Here's two shaders, 1) is with master, 2) is with this series:
>>>>
>>>> 1) https://hastebin.com/uvamarelig
>>>> 2) https://hastebin.com/voguqihilu
>>>>
>>>> The first shader contains a bunch of s_buffer_load_dword that the second
>>>> one doesn't need because CSE do its job there. This is because of
>>>> IntToPtr
>>>> but if we use noalias, the pass is able to eliminate the redundant
>>>> descriptor load operations.
>>>
>>>
>>>
>>> So I looked into your example again in more detail, compiling both
>>> shaders
>>> with
>>>
>>>    llc -march=amdgcn -mcpu=tonga
>>>
>>> and also just extracting the assembly, and I think your analysis is
>>> actually
>>> flawed. It's true that the shader in (2) has fewer buffer loads, but the
>>> only buffer loads that are actually removed rather than just shuffled
>>> around
>>> are ones that load .y components. So basically, the reason you get fewer
>>> buffer loads is that you're effectly using 32 bit pointers. It's a bit
>>> shocking that that gives you a 7% performance improvement...
>>>
>>> If there are really examples where it makes a difference for CSE, then
>>> perhaps the same result could be achieved with alias/noalias metadata on
>>> the
>>> load/store instructions.
>>>
>>> In the meantime, perhaps a good comparison would be to use the original,
>>> inttoptr-based code, but only load the lower 32 bits for a handle and use
>>> bit shifts to get a full 64 bit pointers. That way, at least it should be
>>> possible to more easily find shaders where there is a genuine difference.
>>
>>
>> He probably gave you wrong shaders. If you imagine an NxN filter doing
>> NxN texture fetches, you should get NxN loads for the same descriptor
>> with inttoptr. Also, LICM won't move the descriptor loads out of
>> loops. Those are the biggest issues. If you can't do CSE and LICM, it
>> can easily make the 7% difference.
>
>
> There are many DOW3 shaders that are different with this series. The one I
> pasted you is maybe not the most interesting one though. I can try to find
> one with descriptor loads inside a loop.
>
>
>>
>> Marek
>>
>