[Mesa-dev] [PATCH] nouveau: codegen: combineLd/St do not combine indirect loads
Hans de Goede
hdegoede at redhat.com
Thu Apr 21 15:40:09 UTC 2016
Hi,
On 21-04-16 17:09, Samuel Pitoiset wrote:
>
>
> On 04/21/2016 04:46 PM, Hans de Goede wrote:
>> Hi,
>>
>> On 21-04-16 16:28, Ilia Mirkin wrote:
>>> On Thu, Apr 21, 2016 at 9:55 AM, Hans de Goede <hdegoede at redhat.com>
>>> wrote:
>>>> combineLd/St would combine, i.e. :
>>>>
>>>> st u32 # g[$r2+0x0] $r2
>>>> st u32 # g[$r2+0x4] $r3
>>>>
>>>> into:
>>>>
>>>> st u64 # g[$r2+0x0] $r2d
>>>>
>>>> But this is only valid if r2 contains an 8 byte aligned address,
>>>> which is unknown.
>>>>
>>>> This commit checks for src0 dim 0 not being indirect when combining
>>>> loads / stores as combining indirect loads / stores may break alignment
>>>> rules.
>>>
>>> I believe the assumption is that all indirect addresses are 16-byte
>>> aligned. This works out for GL, I think. Although hm... I wonder what
>>> happens if you have a
>>>
>>> layout (std430) buffer foo {
>>> int x[16];
>>> }
>>>
>>> And you access x[i], x[i+1], and i == 1. I think we end up doing a ton
>>> of size-based validation which might avoid the problem.
>>>
>>> My concern is that now constbufs will get the same treatment, and for
>>> constbufs the alignment is always 16 :(
>>>
>>> What do you think? Just drop those, or add extra conditionals to allow
>>> it for constbufs?
>>
>> I'm not sure we've the alignment guarantee for constbufs, IIRC we lower
>> const buf accesses to be indirect because we want to provide more then 8
>> UBO-s,
>> right ? So we read the offset from NVC0_CB_AUX_BUF_INFO and then end up
>> with e.g.:
>
> Right. This is because the launch descriptor used for compute shaders on kepler only allows to set up 8 CBs. But OpenGL requires at least 14 UBOs, so the logic is to stick UBOs' information into the driver constant buffer.
>
> As you can, we do this dance for all UBOs because it's simpler that testing if an UBO has been described in the launch descriptor or not (so if it's mapped as c1[], c2[], etc).
>
> The lowering pass should properly handle indirect UBO accesses (I did write a piglit test for that and looked at blob). But I'm not sure if we can break alignment here.
>
> Do you have a simple shader that might hit the issue?
I'm definitely hitting the issue with opencl programs,
specifically with:
piglit/tests/cl/program/execute/get-num-groups.cl
Which contains:
kernel void fill3d(global int* out) {
unsigned int id = get_global_id(0) + get_global_size(0)*get_global_id(1
out[3*id] = get_num_groups(0);
out[3*id+1] = get_num_groups(1);
out[3*id+2] = get_num_groups(2);
}
Notice the 3 * id, we end up combining
get_num_groups(0) and get_num_groups(1)
into a single 64 bit store, which for
(id % 2 == 1) results in an unaligned trap
on the gpu.
Interestingly enough this is the only piglet cl
test which triggers this, but still this is a real
problem AFAICT.
Note this gets translated into:
COMP
DCL SV[0], BLOCK_ID
DCL SV[1], BLOCK_SIZE
DCL SV[2], GRID_SIZE
DCL SV[3], THREAD_ID
DCL MEMORY[0], GLOBAL
DCL MEMORY[1], SHARED
DCL MEMORY[2], PRIVATE
DCL MEMORY[3], INPUT
IMM[0] UINT32 {2, 0, 0, 0}
IMM[1] UINT32 {0, 0, 0, 0}
0: BGNSUB :0
1: UMUL TEMP[1].x, SV[1].xxxx, SV[0].xxxx
2: UADD TEMP[1].x, SV[3].xxxx, TEMP[1].xxxx
3: SHL TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
4: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
5: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
6: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
7: RET
8: ENDSUB
IMM[2] UINT32 {3, 0, 0, 0}
IMM[3] UINT32 {4, 0, 0, 0}
9: BGNSUB :0
10: UMUL TEMP[1].x, SV[1].yyyy, SV[0].yyyy
11: UADD TEMP[1].x, SV[3].yyyy, TEMP[1].xxxx
12: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
13: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
14: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
15: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
16: SHL TEMP[1].x, TEMP[1].xxxx, IMM[2].xxxx
17: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
18: UADD TEMP[1].z, TEMP[1].yyyy, TEMP[1].xxxx
19: STORE MEMORY[0].x, TEMP[1].zzzz, SV[2].xxxx
20: OR TEMP[1].x, TEMP[1].xxxx, IMM[3].xxxx
21: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
22: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].yyyy
23: RET
24: ENDSUB
IMM[4] UINT32 {12, 0, 0, 0}
IMM[5] UINT32 {8, 0, 0, 0}
25: BGNSUB :0
26: UMUL TEMP[1].x, SV[1].zzzz, SV[0].zzzz
27: UADD TEMP[1].x, SV[3].zzzz, TEMP[1].xxxx
28: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].yyyy
29: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].yyyy
30: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].yyyy
31: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].yyyy
32: UMUL TEMP[1].x, TEMP[1].xxxx, SV[2].xxxx
33: UADD TEMP[1].x, TEMP[1].xxxx, SV[0].xxxx
34: UMUL TEMP[1].x, TEMP[1].xxxx, SV[1].xxxx
35: UADD TEMP[1].x, TEMP[1].xxxx, SV[3].xxxx
36: UMUL TEMP[1].x, TEMP[1].xxxx, IMM[4].xxxx
37: LOAD TEMP[1].y, MEMORY[3].xxxx, IMM[1]
38: UADD TEMP[1].x, TEMP[1].yyyy, TEMP[1].xxxx
39: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].xxxx
40: UADD TEMP[1].y, TEMP[1].xxxx, IMM[3].xxxx
41: STORE MEMORY[0].x, TEMP[1].yyyy, SV[2].yyyy
42: UADD TEMP[1].x, TEMP[1].xxxx, IMM[5].xxxx
43: STORE MEMORY[0].x, TEMP[1].xxxx, SV[2].zzzz
44: RET
45: ENDSUB
With the SUB beginning at 25: corresponding to
the troublesome fill3d function.
With my fix the generated code for this is:
0: rdsv u32 $r0 sv[CTAID:2] (8)
1: rdsv u32 $r1 sv[TID:2] (8)
2: mad u32 $r2 $r0 c7[0xe8] $r1 (8)
3: ld u64 $r0d c7[0xf0] (8)
4: rdsv u32 $r3 sv[CTAID:1] (8)
5: mad u32 $r2 $r2 c7[0xf0] $r3 (8)
6: rdsv u32 $r3 sv[TID:1] (8)
7: mad u32 $r2 $r2 c7[0xe4] $r3 (8)
8: ld u32 $r3 c7[0xec] (8)
9: rdsv u32 $r4 sv[CTAID:0] (8)
10: mad u32 $r2 $r2 c7[0xec] $r4 (8)
11: rdsv u32 $r4 sv[TID:0] (8)
12: mad u32 $r2 $r2 c7[0xe0] $r4 (8)
13: mov u32 $r4 0x0000000c (8)
14: mad u32 $r2 $r2 $r4 c0[0x0] (8)
15: st u32 # g[$r2+0x0] $r3 (8)
16: add u32 $r3 $r2 0x00000004 (8)
17: st u32 # g[$r2+0x4] $r0 (8)
18: add u32 $r0 $r2 0x00000008 (8)
19: st u32 # g[$r2+0x8] $r1 (8)
20: ret (8)
Notice that this code also seems to hit
another bug, instructions 16 and 18
got folded into the "st" instructions as offset,
but they did not get deleted.
Any clues for where to start looking at
the root cause of that are welcome.
###
As for Ilia's solution to not disallow
combining of indirect loads for constbufs
given the discussion that seems sensible,
at least for ubo-s, for opencl the
input parameters may end up being indirectly
accessed in an unaligned matter too.
Regards,
Hans
>
>>
>> ld u64 r2d c7[r1+0x0]
>>
>> Where r1 contains the offset of the user-buf. But what if the user is
>> somehow
>> indirectly accessing the userbuf, then we will have added that indirect
>> offset
>> to r1, and we can no longer assume that we can safely merge the loads
>> without
>> breaking alignment rules.
>>
>> I hope I'm making sense here, I'm still a bit unsure about the details how
>> this all works.
>>
>> Regards,
>>
>> Hans
>
More information about the mesa-dev
mailing list