[Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

Wed Jul 19 07:14:08 UTC 2017

On 07/19/2017 12:01 AM, Marek Olšák wrote:
> On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> Hi Samuel,
>>
>> On 07.07.2017 03:45, Samuel Pitoiset wrote:
>>>
>>> On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:
>>>>
>>>> On 04.07.2017 15:05, Samuel Pitoiset wrote:
>>>>>
>>>>> Using VRAM address as bindless handles is not a good idea because
>>>>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
>>>>> because it has no information about the pointer.
>>>>>
>>>>> Instead, use slots indexes like the existing descriptors.
>>>>>
>>>>> This improves performance with DOW3 by +7%.
>>>>
>>>>
>>>> Wow.
>>>>
>>>> The thing is, burning a pair of user SGPRs for this seems a bit overkill,
>>>> especially since it also hurts apps that don't use bindless at all.
>>>>
>>>> Do you have some examples of how LLVM fails here? Could we perhaps avoid
>>>> most of the performance issues by casting 0 to an appropriate pointer type
>>>> once, and then using the bindless handle as an index relative to that
>>>> pointer?
>>>
>>>
>>> Here's two shaders, 1) is with master, 2) is with this series:
>>>
>>> 1) https://hastebin.com/uvamarelig
>>> 2) https://hastebin.com/voguqihilu
>>>
>>> The first shader contains a bunch of s_buffer_load_dword that the second
>>> one doesn't need because CSE do its job there. This is because of IntToPtr
>>> but if we use noalias, the pass is able to eliminate the redundant
>>> descriptor load operations.
>>
>>
>> So I looked into your example again in more detail, compiling both shaders
>> with
>>
>>    llc -march=amdgcn -mcpu=tonga
>>
>> and also just extracting the assembly, and I think your analysis is actually
>> flawed. It's true that the shader in (2) has fewer buffer loads, but the
>> only buffer loads that are actually removed rather than just shuffled around
>> are ones that load .y components. So basically, the reason you get fewer
>> buffer loads is that you're effectly using 32 bit pointers. It's a bit
>> shocking that that gives you a 7% performance improvement...
>>
>> If there are really examples where it makes a difference for CSE, then
>> perhaps the same result could be achieved with alias/noalias metadata on the
>> load/store instructions.
>>
>> In the meantime, perhaps a good comparison would be to use the original,
>> inttoptr-based code, but only load the lower 32 bits for a handle and use
>> bit shifts to get a full 64 bit pointers. That way, at least it should be
>> possible to more easily find shaders where there is a genuine difference.
> 
> He probably gave you wrong shaders. If you imagine an NxN filter doing
> NxN texture fetches, you should get NxN loads for the same descriptor
> with inttoptr. Also, LICM won't move the descriptor loads out of
> loops. Those are the biggest issues. If you can't do CSE and LICM, it
> can easily make the 7% difference.

There are many DOW3 shaders that are different with this series. The one 
I pasted you is maybe not the most interesting one though. I can try to 
find one with descriptor loads inside a loop.

> 
> Marek
>