[Mesa-dev] [PATCH v2 00/16] radeonsi: improve handling of temporary arrays

Fri Aug 12 12:23:35 UTC 2016

On 11.08.2016 12:57, Marek Olšák wrote:
> On Thu, Aug 11, 2016 at 11:29 AM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> On 10.08.2016 23:36, Marek Olšák wrote:
>>>
>>> On Wed, Aug 10, 2016 at 9:23 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> this is a respin of the series which scans the shader's TGSI to determine
>>>> which channels of an array are actually written to. Most of the st/mesa
>>>> changes have become unnecessary. Most of the radeon-specific part stays
>>>> the same.
>>>>
>>>> For one F1 2015 shader, it reduces the scratch size from 132096 to 26624
>>>> bytes, which is bound to be much nicer on the texture cache.
>>>
>>>
>>> This has been bugging me... is there something we can do to move
>>> temporary arrays to registers?
>>>
>>> F1 2015 is the only game that doesn't "spill VGPRs", yet has the
>>> highest scratch usage per shader. (without this series)
>>>
>>> If a shader uses 32 VGPRs and a *ton* of scratch space, you know
>>> something is wrong.
>>
>>
>> We actually already do that partially: in emit_declaration, we check the
>> size of the array, and if it's below a certain threshold (<= 16 currently)
>> it is lowered to LLVM IR that becomes registers. In particular, that one
>> shader has:
>>
>> Before: Shader Stats: SGPRS: 40 VGPRS: 32 Code Size: 3316 LDS: 0 Scratch:
>> 132096 Max Waves: 8 Spilled SGPRs: 0 Spilled VGPRs: 0
>> After: Shader Stats: SGPRS: 32 VGPRS: 60 Code Size: 3068 LDS: 0 Scratch:
>> 26624 Max Waves: 4 Spilled SGPRs: 0 Spilled VGPRs: 0
>>
>> Looks like some of the arrays now land in VGPRs since they have become
>> smaller with that series.
>>
>> There are still a _lot_ of weaknesses in all of this, and they mostly have
>> to do with limitations that are rather deeply baked into assumptions of
>> LLVM's codegen architecture.
>>
>> The biggest problem is that an array in VGPRs needs to be represented
>> somehow in the codegen, and it is currently being represented as one of the
>> VGPR vector register classes, which go up to VReg_512, i.e. 16 registers.
>> Two problems with that:
>>
>> 1. The granularity sucks. If you have an array of 10 entries, it'll end up
>> effectively using 16 registers anyway.
>>
>> 2. You can't go above arrays of size 16. (Though to be fair, once you reach
>> that size, you should probably start worrying about VGPR pressure.)
>>
>> Some other issues are that
>>
>> 3. It should really be LLVM that decides how to lower an array, not Mesa.
>> Ideally, LLVM should be able to make an intelligent decision based on the
>> overall register pressure.
>>
>> 4. We currently don't use LDS for shaders. This was disabled because LLVM
>> needs to be taught about interactions with other LDS uses, especially in
>> tessellation.
>
> I think we should first focus on PS and CS. Sadly, LDS is pretty
> small. We can spill at most 128 dwords (256 on CIK/VI) per thread, but
> all LDS is used at that point and the wave count is 1 per CU (0.25 per
> SIMD) = worse than scratch. A more conservative approach is to have a
> maximum of 16 (32 - CIK/VI) dwords of LDS per thread, which should
> give us 2 waves per SIMD (with zero PS inputs and no DDX/DDY) or 1-2
> waves per SIMD depending on the number of PS inputs (but never 2 on
> all SIMDs).

You're right, I got confused and did my calculation assuming the LDS was 
per-SIMD. You basically want an LDS-use per thread that corresponds to a 
bit less than 1/8 (SI) or 1/4 (CIK/VI) of the number of VGPRs. BTW, I 
think VI can do DDX/DDY without actually using LDS memory (via the 
ds_permute instructions). Anyway, that makes LDS much less effective.

Nicolai

> Marek
>