[Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Ilia Mirkin imirkin at alum.mit.edu
Thu Apr 21 15:52:52 UTC 2016


On Thu, Apr 21, 2016 at 11:40 AM, Hans de Goede <hdegoede at redhat.com> wrote:
> Hi,
>
>
> On 21-04-16 17:09, Samuel Pitoiset wrote:
>>
>>
>>
>> On 04/21/2016 04:46 PM, Hans de Goede wrote:
>>>
>>> Hi,
>>>
>>> On 21-04-16 16:28, Ilia Mirkin wrote:
>>>>
>>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdegoede at redhat.com>
>>>> wrote:
>>>>>
>>>>> combineLd/St would combine, i.e. :
>>>>>
>>>>> st  u32 # g[$r2+0x0] $r2
>>>>> st  u32 # g[$r2+0x4] $r3
>>>>>
>>>>> into:
>>>>>
>>>>> st  u64 # g[$r2+0x0] $r2d
>>>>>
>>>>> But this is only valid if r2 contains an 8 byte aligned address,
>>>>> which is unknown.
>>>>>
>>>>> This commit checks for src0 dim 0 not being indirect when combining
>>>>> loads / stores as combining indirect loads / stores may break alignment
>>>>> rules.
>>>>
>>>>
>>>> I believe the assumption is that all indirect addresses are 16-byte
>>>> aligned. This works out for GL, I think. Although hm... I wonder what
>>>> happens if you have a
>>>>
>>>> layout (std430) buffer foo {
>>>>    int x[16];
>>>> }
>>>>
>>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
>>>> of size-based validation which might avoid the problem.
>>>>
>>>> My concern is that now constbufs will get the same treatment, and for
>>>> constbufs the alignment is always 16 :(
>>>>
>>>> What do you think? Just drop those, or add extra conditionals to allow
>>>> it for constbufs?
>>>
>>>
>>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
>>> const buf accesses to be indirect because we want to provide more then 8
>>> UBO-s,
>>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
>>> with e.g.:
>>
>>
>> Right. This is because the launch descriptor used for compute shaders on
>> kepler only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so
>> the logic is to stick UBOs' information into the driver constant buffer.
>>
>> As you can, we do this dance for all UBOs because it's simpler that
>> testing if an UBO has been described in the launch descriptor or not (so if
>> it's mapped as c1[], c2[], etc).
>>
>> The lowering pass should properly handle indirect UBO accesses (I did
>> write a piglit test for that and looked at blob). But I'm not sure if we can
>> break alignment here.
>>
>> Do you have a simple shader that might hit the issue?
>
>
> I'm definitely hitting the issue with opencl programs,
> specifically with:
>
> piglit/tests/cl/program/execute/get-num-groups.cl
>
> Which contains:
>
> kernel void fill3d(global int* out) {
>         unsigned int id =  get_global_id(0) +
> get_global_size(0)*get_global_id(1
>         out[3*id] = get_num_groups(0);
>         out[3*id+1] = get_num_groups(1);
>         out[3*id+2] = get_num_groups(2);
> }
>
> Notice the 3 * id, we end up combining
> get_num_groups(0) and get_num_groups(1)
> into a single 64 bit store, which for
> (id % 2 == 1) results in an unaligned trap
> on the gpu.
>
> Interestingly enough this is the only piglet cl
> test which triggers this, but still this is a real
> problem AFAICT.
>
> Note this gets translated into:
>
> COMP
> DCL SV[0], BLOCK_ID
> DCL SV[1], BLOCK_SIZE
> DCL SV[2], GRID_SIZE
> DCL SV[3], THREAD_ID
> DCL MEMORY[0], GLOBAL
> DCL MEMORY[1], SHARED
> DCL MEMORY[2], PRIVATE
> DCL MEMORY[3], INPUT
>
> IMM[0] UINT32 {2, 0, 0, 0}
> IMM[1] UINT32 {0, 0, 0, 0}
>   0: BGNSUB :0
>   1:   UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
>   2:   UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
>   3:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
>   4:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>   5:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>   6:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>   7:   RET
>   8: ENDSUB
> IMM[2] UINT32 {3, 0, 0, 0}
> IMM[3] UINT32 {4, 0, 0, 0}
>   9: BGNSUB :0
>  10:   UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
>  11:   UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
>  12:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>  13:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>  14:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>  15:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>  16:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
>  17:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>  18:   UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
>  19:   STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
>  20:   OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
>  21:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>  22:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
>  23:   RET
>  24: ENDSUB
> IMM[4] UINT32 {12, 0, 0, 0}
> IMM[5] UINT32 {8, 0, 0, 0}
>  25: BGNSUB :0
>  26:   UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
>  27:   UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
>  28:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
>  29:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
>  30:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
>  31:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
>  32:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>  33:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>  34:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>  35:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>  36:   UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
>  37:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>  38:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>  39:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>  40:   UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
>  41:   STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
>  42:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
>  43:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
>  44:   RET
>  45: ENDSUB
>
> With the SUB beginning at 25: corresponding to
> the troublesome fill3d function.
>
> With my fix the generated code for this is:
>
>   0: rdsv u32 $r0 sv[CTAID:2] (8)
>   1: rdsv u32 $r1 sv[TID:2] (8)
>   2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
>   3: ld  u64 $r0d c7[0xf0] (8)
>   4: rdsv u32 $r3 sv[CTAID:1] (8)
>   5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
>   6: rdsv u32 $r3 sv[TID:1] (8)
>   7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
>   8: ld  u32 $r3 c7[0xec] (8)
>   9: rdsv u32 $r4 sv[CTAID:0] (8)
>  10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
>  11: rdsv u32 $r4 sv[TID:0] (8)
>  12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
>  13: mov u32 $r4 0x0000000c (8)
>  14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
>  15: st  u32 # g[$r2+0x0] $r3 (8)
>  16: add u32 $r3 $r2 0x00000004 (8)
>  17: st  u32 # g[$r2+0x4] $r0 (8)
>  18: add u32 $r0 $r2 0x00000008 (8)
>  19: st  u32 # g[$r2+0x8] $r1 (8)
>  20: ret (8)
>
> Notice that this code also seems to hit
> another bug, instructions 16 and 18
> got folded into the "st" instructions as offset,
> but they did not get deleted.

Something must still be using them and they don't get DCE'd. Probably
a screwup somewhere? There's a special folding pass which merges such
small offsets in...

>
> Any clues for where to start looking at
> the root cause of that are welcome.
>
> ###
>
> As for Ilia's solution to not disallow
> combining of indirect loads for constbufs
> given the discussion that seems sensible,
> at least for ubo-s, for opencl the
> input parameters may end up being indirectly
> accessed in an unaligned matter too.

Can OpenCL end up using a UBO? If so, we should just do this for
compute. i.e. allow merging of indirect constbuf loads on non-compute
shaders. That might be the quickest simplest thing to do irregardless.
I doubt this is such a frequent case that this merits more worrying
about.

  -ilia


More information about the mesa-dev mailing list