[Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Fri Apr 22 11:06:31 UTC 2016

Hi,

On 21-04-16 17:52, Ilia Mirkin wrote:
> On Thu, Apr 21, 2016 at 11:40 AM, Hans de Goede <hdegoede at redhat.com> wrote:
>> Hi,
>>
>>
>> On 21-04-16 17:09, Samuel Pitoiset wrote:
>>>
>>>
>>>
>>> On 04/21/2016 04:46 PM, Hans de Goede wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 21-04-16 16:28, Ilia Mirkin wrote:
>>>>>
>>>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdegoede at redhat.com>
>>>>> wrote:
>>>>>>
>>>>>> combineLd/St would combine, i.e. :
>>>>>>
>>>>>> st  u32 # g[$r2+0x0] $r2
>>>>>> st  u32 # g[$r2+0x4] $r3
>>>>>>
>>>>>> into:
>>>>>>
>>>>>> st  u64 # g[$r2+0x0] $r2d
>>>>>>
>>>>>> But this is only valid if r2 contains an 8 byte aligned address,
>>>>>> which is unknown.
>>>>>>
>>>>>> This commit checks for src0 dim 0 not being indirect when combining
>>>>>> loads / stores as combining indirect loads / stores may break alignment
>>>>>> rules.
>>>>>
>>>>>
>>>>> I believe the assumption is that all indirect addresses are 16-byte
>>>>> aligned. This works out for GL, I think. Although hm... I wonder what
>>>>> happens if you have a
>>>>>
>>>>> layout (std430) buffer foo {
>>>>>     int x[16];
>>>>> }
>>>>>
>>>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
>>>>> of size-based validation which might avoid the problem.
>>>>>
>>>>> My concern is that now constbufs will get the same treatment, and for
>>>>> constbufs the alignment is always 16 :(
>>>>>
>>>>> What do you think? Just drop those, or add extra conditionals to allow
>>>>> it for constbufs?
>>>>
>>>>
>>>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
>>>> const buf accesses to be indirect because we want to provide more then 8
>>>> UBO-s,
>>>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
>>>> with e.g.:
>>>
>>>
>>> Right. This is because the launch descriptor used for compute shaders on
>>> kepler only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so
>>> the logic is to stick UBOs' information into the driver constant buffer.
>>>
>>> As you can, we do this dance for all UBOs because it's simpler that
>>> testing if an UBO has been described in the launch descriptor or not (so if
>>> it's mapped as c1[], c2[], etc).
>>>
>>> The lowering pass should properly handle indirect UBO accesses (I did
>>> write a piglit test for that and looked at blob). But I'm not sure if we can
>>> break alignment here.
>>>
>>> Do you have a simple shader that might hit the issue?
>>
>>
>> I'm definitely hitting the issue with opencl programs,
>> specifically with:
>>
>> piglit/tests/cl/program/execute/get-num-groups.cl
>>
>> Which contains:
>>
>> kernel void fill3d(global int* out) {
>>          unsigned int id =  get_global_id(0) +
>> get_global_size(0)*get_global_id(1
>>          out[3*id] = get_num_groups(0);
>>          out[3*id+1] = get_num_groups(1);
>>          out[3*id+2] = get_num_groups(2);
>> }
>>
>> Notice the 3 * id, we end up combining
>> get_num_groups(0) and get_num_groups(1)
>> into a single 64 bit store, which for
>> (id % 2 == 1) results in an unaligned trap
>> on the gpu.
>>
>> Interestingly enough this is the only piglet cl
>> test which triggers this, but still this is a real
>> problem AFAICT.
>>
>> Note this gets translated into:
>>
>> COMP
>> DCL SV[0], BLOCK_ID
>> DCL SV[1], BLOCK_SIZE
>> DCL SV[2], GRID_SIZE
>> DCL SV[3], THREAD_ID
>> DCL MEMORY[0], GLOBAL
>> DCL MEMORY[1], SHARED
>> DCL MEMORY[2], PRIVATE
>> DCL MEMORY[3], INPUT
>>
>> IMM[0] UINT32 {2, 0, 0, 0}
>> IMM[1] UINT32 {0, 0, 0, 0}
>>    0: BGNSUB :0
>>    1:   UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
>>    2:   UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
>>    3:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
>>    4:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>    5:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>    6:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>>    7:   RET
>>    8: ENDSUB
>> IMM[2] UINT32 {3, 0, 0, 0}
>> IMM[3] UINT32 {4, 0, 0, 0}
>>    9: BGNSUB :0
>>   10:   UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
>>   11:   UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
>>   12:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>>   13:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>>   14:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>>   15:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>>   16:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
>>   17:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>   18:   UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
>>   19:   STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
>>   20:   OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
>>   21:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>   22:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
>>   23:   RET
>>   24: ENDSUB
>> IMM[4] UINT32 {12, 0, 0, 0}
>> IMM[5] UINT32 {8, 0, 0, 0}
>>   25: BGNSUB :0
>>   26:   UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
>>   27:   UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
>>   28:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
>>   29:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
>>   30:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
>>   31:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
>>   32:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>>   33:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>>   34:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>>   35:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>>   36:   UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
>>   37:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>   38:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>   39:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>>   40:   UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
>>   41:   STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
>>   42:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
>>   43:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
>>   44:   RET
>>   45: ENDSUB
>>
>> With the SUB beginning at 25: corresponding to
>> the troublesome fill3d function.
>>
>> With my fix the generated code for this is:
>>
>>    0: rdsv u32 $r0 sv[CTAID:2] (8)
>>    1: rdsv u32 $r1 sv[TID:2] (8)
>>    2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
>>    3: ld  u64 $r0d c7[0xf0] (8)
>>    4: rdsv u32 $r3 sv[CTAID:1] (8)
>>    5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
>>    6: rdsv u32 $r3 sv[TID:1] (8)
>>    7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
>>    8: ld  u32 $r3 c7[0xec] (8)
>>    9: rdsv u32 $r4 sv[CTAID:0] (8)
>>   10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
>>   11: rdsv u32 $r4 sv[TID:0] (8)
>>   12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
>>   13: mov u32 $r4 0x0000000c (8)
>>   14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
>>   15: st  u32 # g[$r2+0x0] $r3 (8)
>>   16: add u32 $r3 $r2 0x00000004 (8)
>>   17: st  u32 # g[$r2+0x4] $r0 (8)
>>   18: add u32 $r0 $r2 0x00000008 (8)
>>   19: st  u32 # g[$r2+0x8] $r1 (8)
>>   20: ret (8)
>>
>> Notice that this code also seems to hit
>> another bug, instructions 16 and 18
>> got folded into the "st" instructions as offset,
>> but they did not get deleted.
>
> Something must still be using them and they don't get DCE'd. Probably
> a screwup somewhere? There's a special folding pass which merges such
> small offsets in...

Ok, so I added the following to debug this:

diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
index e62ac06..9ce062e 100644
--- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
+++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
@@ -298,9 +298,22 @@ IndirectPropagation::visit(BasicBlock *bb)
                  !insn->src(1).getImmediate(imm) ||
                  !targ->insnCanLoadOffset(i, s, imm.reg.data.s32))
                 continue;
+            printf("folding indirect add into offset ld src %p ind0 %p add def0 %p %d\n",
+                   i->getSrc(s), i->getIndirect(s, 0), insn->getDef(0), insn->getDef(0)->refCount());
+            for (Value::UseIterator it = insn->getDef(0)->uses.begin();
+                 it != insn->getDef(0)->uses.end(); ++it) {
+               ValueRef *ref = *it;
+               printf("ref value %p, indirect %d %d file %d size %d\n",
+                      ref->get(), ref->indirect[0], ref->indirect[1],
+                      (int)ref->getFile(), (int)ref->getSize());
+               if ((*it)->getInsn())
+                   (*it)->getInsn()->print();
+            }
              i->setIndirect(s, 0, insn->getSrc(0));
              i->setSrc(s, cloneShallow(func, i->getSrc(s)));
              i->src(s).get()->reg.data.offset += imm.reg.data.u32;
+            printf("folded  indirect add into offset ld src %p ind0 %p add def0 %p %d\n",
+                   i->getSrc(s), i->getIndirect(s, 0), insn->getDef(0), insn->getDef(0)->refCount());
           } else if (insn->op == OP_SUB && !isFloatType(insn->dType)) {
              if (insn->src(0).getFile() != targ->nativeFile(FILE_ADDRESS) ||
                  !insn->src(1).getImmediate(imm) ||


And the output is quite interesting:

folding indirect add into offset ld src 0x2dda328 ind0 0x27ca558 add def0 0x27ca558 2
ref value 0x27ca558, indirect -1 -1 file 1 size 4
st  u32 # g[%r70+0x0] %r38 (0)
ref value 0x27ca558, indirect -1 -1 file 1 size 4

Note how the uses unordered_set contains 2 ValueRefs pointing to getDef(0)
of the instruction which is not being eliminated as it should.

The second reference does not have insn set, and is still present after
IndirectPropagation::visit() has done its work (I had a printf there
too in an earlier version of my debug printfs).

I've a feeling I'm getting close to the problem, but I'm not familiar
enough with this bit of the code to figure out where the second ref
comes from, or how te remove it.

>> Any clues for where to start looking at
>> the root cause of that are welcome.
>>
>> ###
>>
>> As for Ilia's solution to not disallow
>> combining of indirect loads for constbufs
>> given the discussion that seems sensible,
>> at least for ubo-s, for opencl the
>> input parameters may end up being indirectly
>> accessed in an unaligned matter too.
>
> Can OpenCL end up using a UBO? If so, we should just do this for
> compute. i.e. allow merging of indirect constbuf loads on non-compute
> shaders. That might be the quickest simplest thing to do irregardless.
> I doubt this is such a frequent case that this merits more worrying
> about.

OK, I will send a v2 which only does not combine indirect loads /
stores for compute.

Regards,

Hans