[Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

Marek Olšák maraeo at gmail.com
Tue Jul 18 22:01:49 UTC 2017


On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> Hi Samuel,
>
> On 07.07.2017 03:45, Samuel Pitoiset wrote:
>>
>> On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:
>>>
>>> On 04.07.2017 15:05, Samuel Pitoiset wrote:
>>>>
>>>> Using VRAM address as bindless handles is not a good idea because
>>>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
>>>> because it has no information about the pointer.
>>>>
>>>> Instead, use slots indexes like the existing descriptors.
>>>>
>>>> This improves performance with DOW3 by +7%.
>>>
>>>
>>> Wow.
>>>
>>> The thing is, burning a pair of user SGPRs for this seems a bit overkill,
>>> especially since it also hurts apps that don't use bindless at all.
>>>
>>> Do you have some examples of how LLVM fails here? Could we perhaps avoid
>>> most of the performance issues by casting 0 to an appropriate pointer type
>>> once, and then using the bindless handle as an index relative to that
>>> pointer?
>>
>>
>> Here's two shaders, 1) is with master, 2) is with this series:
>>
>> 1) https://hastebin.com/uvamarelig
>> 2) https://hastebin.com/voguqihilu
>>
>> The first shader contains a bunch of s_buffer_load_dword that the second
>> one doesn't need because CSE do its job there. This is because of IntToPtr
>> but if we use noalias, the pass is able to eliminate the redundant
>> descriptor load operations.
>
>
> So I looked into your example again in more detail, compiling both shaders
> with
>
>   llc -march=amdgcn -mcpu=tonga
>
> and also just extracting the assembly, and I think your analysis is actually
> flawed. It's true that the shader in (2) has fewer buffer loads, but the
> only buffer loads that are actually removed rather than just shuffled around
> are ones that load .y components. So basically, the reason you get fewer
> buffer loads is that you're effectly using 32 bit pointers. It's a bit
> shocking that that gives you a 7% performance improvement...
>
> If there are really examples where it makes a difference for CSE, then
> perhaps the same result could be achieved with alias/noalias metadata on the
> load/store instructions.
>
> In the meantime, perhaps a good comparison would be to use the original,
> inttoptr-based code, but only load the lower 32 bits for a handle and use
> bit shifts to get a full 64 bit pointers. That way, at least it should be
> possible to more easily find shaders where there is a genuine difference.

He probably gave you wrong shaders. If you imagine an NxN filter doing
NxN texture fetches, you should get NxN loads for the same descriptor
with inttoptr. Also, LICM won't move the descriptor loads out of
loops. Those are the biggest issues. If you can't do CSE and LICM, it
can easily make the 7% difference.

Marek


More information about the mesa-dev mailing list