[Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads

Sun Apr 24 21:35:00 UTC 2016

On Fri, Apr 22, 2016 at 7:06 AM, Hans de Goede <hdegoede at redhat.com> wrote:
> Hi,
>
>
> On 21-04-16 17:52, Ilia Mirkin wrote:
>>
>> On Thu, Apr 21, 2016 at 11:40 AM, Hans de Goede <hdegoede at redhat.com>
>> wrote:
>>>
>>> Hi,
>>>
>>>
>>> On 21-04-16 17:09, Samuel Pitoiset wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On 04/21/2016 04:46 PM, Hans de Goede wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 21-04-16 16:28, Ilia Mirkin wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdegoede at redhat.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> combineLd/St would combine, i.e. :
>>>>>>>
>>>>>>> st  u32 # g[$r2+0x0] $r2
>>>>>>> st  u32 # g[$r2+0x4] $r3
>>>>>>>
>>>>>>> into:
>>>>>>>
>>>>>>> st  u64 # g[$r2+0x0] $r2d
>>>>>>>
>>>>>>> But this is only valid if r2 contains an 8 byte aligned address,
>>>>>>> which is unknown.
>>>>>>>
>>>>>>> This commit checks for src0 dim 0 not being indirect when combining
>>>>>>> loads / stores as combining indirect loads / stores may break
>>>>>>> alignment
>>>>>>> rules.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I believe the assumption is that all indirect addresses are 16-byte
>>>>>> aligned. This works out for GL, I think. Although hm... I wonder what
>>>>>> happens if you have a
>>>>>>
>>>>>> layout (std430) buffer foo {
>>>>>>     int x[16];
>>>>>> }
>>>>>>
>>>>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
>>>>>> of size-based validation which might avoid the problem.
>>>>>>
>>>>>> My concern is that now constbufs will get the same treatment, and for
>>>>>> constbufs the alignment is always 16 :(
>>>>>>
>>>>>> What do you think? Just drop those, or add extra conditionals to allow
>>>>>> it for constbufs?
>>>>>
>>>>>
>>>>>
>>>>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
>>>>> const buf accesses to be indirect because we want to provide more then
>>>>> 8
>>>>> UBO-s,
>>>>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
>>>>> with e.g.:
>>>>
>>>>
>>>>
>>>> Right. This is because the launch descriptor used for compute shaders on
>>>> kepler only allows to set up 8 CBs. But OpenGL requires at least 14
>>>> UBOs, so
>>>> the logic is to stick UBOs' information into the driver constant buffer.
>>>>
>>>> As you can, we do this dance for all UBOs because it's simpler that
>>>> testing if an UBO has been described in the launch descriptor or not (so
>>>> if
>>>> it's mapped as c1[], c2[], etc).
>>>>
>>>> The lowering pass should properly handle indirect UBO accesses (I did
>>>> write a piglit test for that and looked at blob). But I'm not sure if we
>>>> can
>>>> break alignment here.
>>>>
>>>> Do you have a simple shader that might hit the issue?
>>>
>>>
>>>
>>> I'm definitely hitting the issue with opencl programs,
>>> specifically with:
>>>
>>> piglit/tests/cl/program/execute/get-num-groups.cl
>>>
>>> Which contains:
>>>
>>> kernel void fill3d(global int* out) {
>>>          unsigned int id =  get_global_id(0) +
>>> get_global_size(0)*get_global_id(1
>>>          out[3*id] = get_num_groups(0);
>>>          out[3*id+1] = get_num_groups(1);
>>>          out[3*id+2] = get_num_groups(2);
>>> }
>>>
>>> Notice the 3 * id, we end up combining
>>> get_num_groups(0) and get_num_groups(1)
>>> into a single 64 bit store, which for
>>> (id % 2 == 1) results in an unaligned trap
>>> on the gpu.
>>>
>>> Interestingly enough this is the only piglet cl
>>> test which triggers this, but still this is a real
>>> problem AFAICT.
>>>
>>> Note this gets translated into:
>>>
>>> COMP
>>> DCL SV[0], BLOCK_ID
>>> DCL SV[1], BLOCK_SIZE
>>> DCL SV[2], GRID_SIZE
>>> DCL SV[3], THREAD_ID
>>> DCL MEMORY[0], GLOBAL
>>> DCL MEMORY[1], SHARED
>>> DCL MEMORY[2], PRIVATE
>>> DCL MEMORY[3], INPUT
>>>
>>> IMM[0] UINT32 {2, 0, 0, 0}
>>> IMM[1] UINT32 {0, 0, 0, 0}
>>>    0: BGNSUB :0
>>>    1:   UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
>>>    2:   UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
>>>    3:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
>>>    4:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>>    5:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>>    6:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>>>    7:   RET
>>>    8: ENDSUB
>>> IMM[2] UINT32 {3, 0, 0, 0}
>>> IMM[3] UINT32 {4, 0, 0, 0}
>>>    9: BGNSUB :0
>>>   10:   UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
>>>   11:   UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
>>>   12:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>>>   13:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>>>   14:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>>>   15:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>>>   16:   SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
>>>   17:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>>   18:   UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
>>>   19:   STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
>>>   20:   OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
>>>   21:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>>   22:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
>>>   23:   RET
>>>   24: ENDSUB
>>> IMM[4] UINT32 {12, 0, 0, 0}
>>> IMM[5] UINT32 {8, 0, 0, 0}
>>>   25: BGNSUB :0
>>>   26:   UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
>>>   27:   UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
>>>   28:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
>>>   29:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
>>>   30:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
>>>   31:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
>>>   32:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
>>>   33:   UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
>>>   34:   UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
>>>   35:   UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
>>>   36:   UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
>>>   37:   LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
>>>   38:   UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
>>>   39:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
>>>   40:   UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
>>>   41:   STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
>>>   42:   UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
>>>   43:   STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
>>>   44:   RET
>>>   45: ENDSUB
>>>
>>> With the SUB beginning at 25: corresponding to
>>> the troublesome fill3d function.
>>>
>>> With my fix the generated code for this is:
>>>
>>>    0: rdsv u32 $r0 sv[CTAID:2] (8)
>>>    1: rdsv u32 $r1 sv[TID:2] (8)
>>>    2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
>>>    3: ld  u64 $r0d c7[0xf0] (8)
>>>    4: rdsv u32 $r3 sv[CTAID:1] (8)
>>>    5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
>>>    6: rdsv u32 $r3 sv[TID:1] (8)
>>>    7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
>>>    8: ld  u32 $r3 c7[0xec] (8)
>>>    9: rdsv u32 $r4 sv[CTAID:0] (8)
>>>   10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
>>>   11: rdsv u32 $r4 sv[TID:0] (8)
>>>   12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
>>>   13: mov u32 $r4 0x0000000c (8)
>>>   14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
>>>   15: st  u32 # g[$r2+0x0] $r3 (8)
>>>   16: add u32 $r3 $r2 0x00000004 (8)
>>>   17: st  u32 # g[$r2+0x4] $r0 (8)
>>>   18: add u32 $r0 $r2 0x00000008 (8)
>>>   19: st  u32 # g[$r2+0x8] $r1 (8)
>>>   20: ret (8)
>>>
>>> Notice that this code also seems to hit
>>> another bug, instructions 16 and 18
>>> got folded into the "st" instructions as offset,
>>> but they did not get deleted.
>>
>>
>> Something must still be using them and they don't get DCE'd. Probably
>> a screwup somewhere? There's a special folding pass which merges such
>> small offsets in...
>
>
> Ok, so I added the following to debug this:
>
> diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
> b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
> index e62ac06..9ce062e 100644
> --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
> +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
> @@ -298,9 +298,22 @@ IndirectPropagation::visit(BasicBlock *bb)
>                  !insn->src(1).getImmediate(imm) ||
>                  !targ->insnCanLoadOffset(i, s, imm.reg.data.s32))
>                 continue;
> +            printf("folding indirect add into offset ld src %p ind0 %p add
> def0 %p %d\n",
> +                   i->getSrc(s), i->getIndirect(s, 0), insn->getDef(0),
> insn->getDef(0)->refCount());
> +            for (Value::UseIterator it = insn->getDef(0)->uses.begin();
> +                 it != insn->getDef(0)->uses.end(); ++it) {
> +               ValueRef *ref = *it;
> +               printf("ref value %p, indirect %d %d file %d size %d\n",
> +                      ref->get(), ref->indirect[0], ref->indirect[1],
> +                      (int)ref->getFile(), (int)ref->getSize());
> +               if ((*it)->getInsn())
> +                   (*it)->getInsn()->print();
> +            }
>              i->setIndirect(s, 0, insn->getSrc(0));
>              i->setSrc(s, cloneShallow(func, i->getSrc(s)));
>              i->src(s).get()->reg.data.offset += imm.reg.data.u32;
> +            printf("folded  indirect add into offset ld src %p ind0 %p add
> def0 %p %d\n",
> +                   i->getSrc(s), i->getIndirect(s, 0), insn->getDef(0),
> insn->getDef(0)->refCount());
>           } else if (insn->op == OP_SUB && !isFloatType(insn->dType)) {
>              if (insn->src(0).getFile() != targ->nativeFile(FILE_ADDRESS) ||
>                  !insn->src(1).getImmediate(imm) ||
>
>
> And the output is quite interesting:
>
> folding indirect add into offset ld src 0x2dda328 ind0 0x27ca558 add def0
> 0x27ca558 2
> ref value 0x27ca558, indirect -1 -1 file 1 size 4
> st  u32 # g[%r70+0x0] %r38 (0)
> ref value 0x27ca558, indirect -1 -1 file 1 size 4
>
> Note how the uses unordered_set contains 2 ValueRefs pointing to getDef(0)
> of the instruction which is not being eliminated as it should.
>
> The second reference does not have insn set, and is still present after
> IndirectPropagation::visit() has done its work (I had a printf there
> too in an earlier version of my debug printfs).
>
> I've a feeling I'm getting close to the problem, but I'm not familiar
> enough with this bit of the code to figure out where the second ref
> comes from, or how te remove it.
.
OK, so the issue here is that with TGSI there is no way to determine
what the arguments or return values of a subroutine are. So nouveau
figures it out implicitly - have a look at the
Converter::BindArgumentsPass. It basically just takes all the
registers that are read in, and makes them arguments, and then
everything that's written are the return values. As such, it
determines that those offset addresses are return values, and that's
why you see

SUB:25 (out %r76 %r72)
BB:0 (20 instructions) - df = { }
 -> BB:1 (cross)
  0: rdsv u32 %r34 sv[CTAID:2] (0)
  1: rdsv u32 %r36 sv[TID:2] (0)
  2: mad u32 %r37 %r34 c15[0x8] %r36 (0)
  3: ld  u64 { %r39 %r78 } c15[0x10] (0)
  4: rdsv u32 %r42 sv[CTAID:1] (0)
  5: mad u32 %r43 %r37 c15[0x10] %r42 (0)
  6: rdsv u32 %r48 sv[TID:1] (0)
  7: mad u32 %r49 %r43 c15[0x4] %r48 (0)
  8: ld  u32 %r51 c15[0xc] (0)
  9: rdsv u32 %r54 sv[CTAID:0] (0)
 10: mad u32 %r55 %r49 c15[0xc] %r54 (0)
 11: rdsv u32 %r60 sv[TID:0] (0)
 12: mad u32 %r61 %r55 c15[0x0] %r60 (0)
 13: mov u32 %r63 0x0000000c (0)
 14: mad u32 %r68 %r61 %r63 c0[0x4] (0)
 15: add u32 %r72 %r68 0x00000004 (0)
 16: st  u64 # g[%r68+0x0] %r51 %r39 (0)
 17: add u32 %r76 %r68 0x00000008 (0)
 18: st  u32 # g[%r68+0x8] %r78 (0)
 19: ret (0)
BB:1 (0 instructions) - idom = BB:0, df = { }

Note that the outs are %r72 and %r76. (Because they happen to end up
being written last into TEMP[1].xy .) Not a great system, but within
the confines of TGSI, pretty much all you can do. Adding an extension
which adds a calling convention onto subroutines would fix this issue.

  -ilia