[Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Thu Apr 21 15:40:09 UTC 2016

Hi,

On 21-04-16 17:09, Samuel Pitoiset wrote:
>
>
> On 04/21/2016 04:46 PM, Hans de Goede wrote:
>> Hi,
>>
>> On 21-04-16 16:28, Ilia Mirkin wrote:
>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdegoede at redhat.com>
>>> wrote:
>>>> combineLd/St would combine, i.e. :
>>>>
>>>> st  u32 # g[$r2+0x0] $r2
>>>> st  u32 # g[$r2+0x4] $r3
>>>>
>>>> into:
>>>>
>>>> st  u64 # g[$r2+0x0] $r2d
>>>>
>>>> But this is only valid if r2 contains an 8 byte aligned address,
>>>> which is unknown.
>>>>
>>>> This commit checks for src0 dim 0 not being indirect when combining
>>>> loads / stores as combining indirect loads / stores may break alignment
>>>> rules.
>>>
>>> I believe the assumption is that all indirect addresses are 16-byte
>>> aligned. This works out for GL, I think. Although hm... I wonder what
>>> happens if you have a
>>>
>>> layout (std430) buffer foo {
>>>    int x[16];
>>> }
>>>
>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
>>> of size-based validation which might avoid the problem.
>>>
>>> My concern is that now constbufs will get the same treatment, and for
>>> constbufs the alignment is always 16 :(
>>>
>>> What do you think? Just drop those, or add extra conditionals to allow
>>> it for constbufs?
>>
>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
>> const buf accesses to be indirect because we want to provide more then 8
>> UBO-s,
>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
>> with e.g.:
>
> Right. This is because the launch descriptor used for compute shaders on kepler only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so the logic is to stick UBOs' information into the driver constant buffer.
>
> As you can, we do this dance for all UBOs because it's simpler that testing if an UBO has been described in the launch descriptor or not (so if it's mapped as c1[], c2[], etc).
>
> The lowering pass should properly handle indirect UBO accesses (I did write a piglit test for that and looked at blob). But I'm not sure if we can break alignment here.
>
> Do you have a simple shader that might hit the issue?

I'm definitely hitting the issue with opencl programs,
specifically with:

piglit/tests/cl/program/execute/get-num-groups.cl

Which contains:

kernel void fill3d(global int* out) {
         unsigned int id =  get_global_id(0) + get_global_size(0)*get_global_id(1
         out[3*id] = get_num_groups(0);
         out[3*id+1] = get_num_groups(1);
         out[3*id+2] = get_num_groups(2);
}

Notice the 3 * id, we end up combining
get_num_groups(0) and get_num_groups(1)
into a single 64 bit store, which for
(id % 2 == 1) results in an unaligned trap
on the gpu.

Interestingly enough this is the only piglet cl
test which triggers this, but still this is a real
problem AFAICT.

Note this gets translated into:

COMP
DCL SV[0], BLOCK_ID
DCL SV[1], BLOCK_SIZE
DCL SV[2], GRID_SIZE
DCL SV[3], THREAD_ID
DCL MEMORY[0], GLOBAL
DCL MEMORY[1], SHARED
DCL MEMORY[2], PRIVATE
DCL MEMORY[3], INPUT

IMM[0] UINT32 {2, 0, 0, 0}
IMM[1] UINT32 {0, 0, 0, 0}
   0: BGNSUB :0
   1:   UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
   2:   UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
   3:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
   4:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
   5:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
   6:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
   7:   RET
   8: ENDSUB
IMM[2] UINT32 {3, 0, 0, 0}
IMM[3] UINT32 {4, 0, 0, 0}
   9: BGNSUB :0
  10:   UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
  11:   UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
  12:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
  13:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
  14:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
  15:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
  16:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
  17:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
  18:   UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
  19:   STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
  20:   OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
  21:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
  22:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
  23:   RET
  24: ENDSUB
IMM[4] UINT32 {12, 0, 0, 0}
IMM[5] UINT32 {8, 0, 0, 0}
  25: BGNSUB :0
  26:   UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
  27:   UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
  28:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
  29:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
  30:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
  31:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
  32:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
  33:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
  34:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
  35:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
  36:   UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
  37:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
  38:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
  39:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
  40:   UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
  41:   STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
  42:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
  43:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
  44:   RET
  45: ENDSUB

With the SUB beginning at 25: corresponding to
the troublesome fill3d function.

With my fix the generated code for this is:

   0: rdsv u32 $r0 sv[CTAID:2] (8)
   1: rdsv u32 $r1 sv[TID:2] (8)
   2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
   3: ld  u64 $r0d c7[0xf0] (8)
   4: rdsv u32 $r3 sv[CTAID:1] (8)
   5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
   6: rdsv u32 $r3 sv[TID:1] (8)
   7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
   8: ld  u32 $r3 c7[0xec] (8)
   9: rdsv u32 $r4 sv[CTAID:0] (8)
  10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
  11: rdsv u32 $r4 sv[TID:0] (8)
  12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
  13: mov u32 $r4 0x0000000c (8)
  14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
  15: st  u32 # g[$r2+0x0] $r3 (8)
  16: add u32 $r3 $r2 0x00000004 (8)
  17: st  u32 # g[$r2+0x4] $r0 (8)
  18: add u32 $r0 $r2 0x00000008 (8)
  19: st  u32 # g[$r2+0x8] $r1 (8)
  20: ret (8)

Notice that this code also seems to hit
another bug, instructions 16 and 18
got folded into the "st" instructions as offset,
but they did not get deleted.

Any clues for where to start looking at
the root cause of that are welcome.

###

As for Ilia's solution to not disallow
combining of indirect loads for constbufs
given the discussion that seems sensible,
at least for ubo-s, for opencl the
input parameters may end up being indirectly
accessed in an unaligned matter too.

Regards,

Hans

>
>>
>> ld u64  r2d c7[r1+0x0]
>>
>> Where r1 contains the offset of the user-buf. But what if the user is
>> somehow
>> indirectly accessing the userbuf, then we will have added that indirect
>> offset
>> to r1, and we can no longer assume that we can safely merge the loads
>> without
>> breaking alignment rules.
>>
>> I hope I'm making sense here, I'm still a bit unsure about the details how
>> this all works.
>>
>> Regards,
>>
>> Hans
>