[Mesa-dev] [PATCH v2 00/16] radeonsi: improve handling of temporary arrays

Thu Aug 11 10:57:14 UTC 2016

On Thu, Aug 11, 2016 at 11:29 AM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 10.08.2016 23:36, Marek Olšák wrote:
>>
>> On Wed, Aug 10, 2016 at 9:23 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> this is a respin of the series which scans the shader's TGSI to determine
>>> which channels of an array are actually written to. Most of the st/mesa
>>> changes have become unnecessary. Most of the radeon-specific part stays
>>> the same.
>>>
>>> For one F1 2015 shader, it reduces the scratch size from 132096 to 26624
>>> bytes, which is bound to be much nicer on the texture cache.
>>
>>
>> This has been bugging me... is there something we can do to move
>> temporary arrays to registers?
>>
>> F1 2015 is the only game that doesn't "spill VGPRs", yet has the
>> highest scratch usage per shader. (without this series)
>>
>> If a shader uses 32 VGPRs and a *ton* of scratch space, you know
>> something is wrong.
>
>
> We actually already do that partially: in emit_declaration, we check the
> size of the array, and if it's below a certain threshold (<= 16 currently)
> it is lowered to LLVM IR that becomes registers. In particular, that one
> shader has:
>
> Before: Shader Stats: SGPRS: 40 VGPRS: 32 Code Size: 3316 LDS: 0 Scratch:
> 132096 Max Waves: 8 Spilled SGPRs: 0 Spilled VGPRs: 0
> After: Shader Stats: SGPRS: 32 VGPRS: 60 Code Size: 3068 LDS: 0 Scratch:
> 26624 Max Waves: 4 Spilled SGPRs: 0 Spilled VGPRs: 0
>
> Looks like some of the arrays now land in VGPRs since they have become
> smaller with that series.
>
> There are still a _lot_ of weaknesses in all of this, and they mostly have
> to do with limitations that are rather deeply baked into assumptions of
> LLVM's codegen architecture.
>
> The biggest problem is that an array in VGPRs needs to be represented
> somehow in the codegen, and it is currently being represented as one of the
> VGPR vector register classes, which go up to VReg_512, i.e. 16 registers.
> Two problems with that:
>
> 1. The granularity sucks. If you have an array of 10 entries, it'll end up
> effectively using 16 registers anyway.
>
> 2. You can't go above arrays of size 16. (Though to be fair, once you reach
> that size, you should probably start worrying about VGPR pressure.)
>
> Some other issues are that
>
> 3. It should really be LLVM that decides how to lower an array, not Mesa.
> Ideally, LLVM should be able to make an intelligent decision based on the
> overall register pressure.
>
> 4. We currently don't use LDS for shaders. This was disabled because LLVM
> needs to be taught about interactions with other LDS uses, especially in
> tessellation.

I think we should first focus on PS and CS. Sadly, LDS is pretty
small. We can spill at most 128 dwords (256 on CIK/VI) per thread, but
all LDS is used at that point and the wave count is 1 per CU (0.25 per
SIMD) = worse than scratch. A more conservative approach is to have a
maximum of 16 (32 - CIK/VI) dwords of LDS per thread, which should
give us 2 waves per SIMD (with zero PS inputs and no DDX/DDY) or 1-2
waves per SIMD depending on the number of PS inputs (but never 2 on
all SIMDs).

Marek