[Mesa-dev] [PATCH v2 00/16] radeonsi: improve handling of temporary arrays

Thu Aug 11 09:58:53 UTC 2016

On 11.08.2016 11:44, Christian König wrote:
> Am 11.08.2016 um 11:29 schrieb Nicolai Hähnle:
>> On 10.08.2016 23:36, Marek Olšák wrote:
>>> On Wed, Aug 10, 2016 at 9:23 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>>> wrote:
>>>> Hi,
>>>>
>>>> this is a respin of the series which scans the shader's TGSI to
>>>> determine
>>>> which channels of an array are actually written to. Most of the st/mesa
>>>> changes have become unnecessary. Most of the radeon-specific part stays
>>>> the same.
>>>>
>>>> For one F1 2015 shader, it reduces the scratch size from 132096 to
>>>> 26624
>>>> bytes, which is bound to be much nicer on the texture cache.
>>>
>>> This has been bugging me... is there something we can do to move
>>> temporary arrays to registers?
>>>
>>> F1 2015 is the only game that doesn't "spill VGPRs", yet has the
>>> highest scratch usage per shader. (without this series)
>>>
>>> If a shader uses 32 VGPRs and a *ton* of scratch space, you know
>>> something is wrong.
>>
>> We actually already do that partially: in emit_declaration, we check
>> the size of the array, and if it's below a certain threshold (<= 16
>> currently) it is lowered to LLVM IR that becomes registers. In
>> particular, that one shader has:
>>
>> Before: Shader Stats: SGPRS: 40 VGPRS: 32 Code Size: 3316 LDS: 0
>> Scratch: 132096 Max Waves: 8 Spilled SGPRs: 0 Spilled VGPRs: 0
>> After: Shader Stats: SGPRS: 32 VGPRS: 60 Code Size: 3068 LDS: 0
>> Scratch: 26624 Max Waves: 4 Spilled SGPRs: 0 Spilled VGPRs: 0
>>
>> Looks like some of the arrays now land in VGPRs since they have become
>> smaller with that series.
>>
>> There are still a _lot_ of weaknesses in all of this, and they mostly
>> have to do with limitations that are rather deeply baked into
>> assumptions of LLVM's codegen architecture.
>>
>> The biggest problem is that an array in VGPRs needs to be represented
>> somehow in the codegen, and it is currently being represented as one
>> of the VGPR vector register classes, which go up to VReg_512, i.e. 16
>> registers. Two problems with that:
>>
>> 1. The granularity sucks. If you have an array of 10 entries, it'll
>> end up effectively using 16 registers anyway.
>>
>> 2. You can't go above arrays of size 16. (Though to be fair, once you
>> reach that size, you should probably start worrying about VGPR pressure.)
>>
>> Some other issues are that
>>
>> 3. It should really be LLVM that decides how to lower an array, not
>> Mesa. Ideally, LLVM should be able to make an intelligent decision
>> based on the overall register pressure.
>>
>> 4. We currently don't use LDS for shaders. This was disabled because
>> LLVM needs to be taught about interactions with other LDS uses,
>> especially in tessellation.
>>
>> I think fixing point 4 is the thing with the highest impact/effort
>> ratio right now.
>>
>> For point 3, perhaps we could actually extend the alloca lowering even
>> further so that it lowers allocas into VGPRs
>> _after_register_allocation_. But there's a whole can of worms
>> associated with this.
>>
>> (Oh, another thing to keep in mind: we cannot do non-uniform relative
>> indexing of VGPR arrays. This is emulated by a loop in the shader. So
>> depending on the access patterns into arrays, LDS or in extreme cases
>> even scratch space can actually be faster than VGPR.)
>>
>> 1+2 are a serious headache. I'm not deeply enough into all the
>> GlobalISel work going on in LLVM, though I've read some things that
>> make me hopeful that CodeGen based on GlobalISel could help (because
>> it generally makes the process of register assignment more flexible
>> and configurable).
>
> When I initially implemented the support for arrays in radeonsi one of
> the fundamental problems was that LLVM couldn't handle anything else
> than power of two vectors in its instruction selection.
>
> I looked a bit into fixing this, but never completed it. Sounds like
> nobody worked on this since then and yes I completely agree with your
> points.

I actually tried to add vec3 at some point since it's not uncommon for 
sample instructions and vertex shader inputs, but also gave up.

We really also eventually need things like vec5 for sparse textures (aka 
PRT), and I think a vec6 for some sample instruction variants, and so on.

The current representation of physical registers in LLVM just isn't well 
suited for that kind of thing, so it ends up being a surprisingly big 
refactoring project.

Nicolai

>
> Regards,
> Christian.
>
>>
>> Cheers,
>> Nicolai
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>