[Mesa-dev] [PATCH] [rfc] r600: set vpm bit for loop start clause

Fri Aug 3 14:26:15 UTC 2018

Am 03.08.2018 um 07:00 schrieb Dave Airlie:
> On 3 August 2018 at 14:02, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 03.08.2018 um 05:10 schrieb Dave Airlie:
>>> From: Dave Airlie <airlied at redhat.com>
>>>
>>> This fixes some hangs with the arb_shader_image_load_store-atomicity tests
>>> on evergreen/cayman GPUs.
>>>
>>> I'm not 100% sure why (VPM hurts my brain), I'm running some piglit
>>> runs to see if it has any bad side effects.
>>> ---
>>>  src/gallium/drivers/r600/r600_shader.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/src/gallium/drivers/r600/r600_shader.c b/src/gallium/drivers/r600/r600_shader.c
>>> index 2229dc8fab3..1341bf2c48c 100644
>>> --- a/src/gallium/drivers/r600/r600_shader.c
>>> +++ b/src/gallium/drivers/r600/r600_shader.c
>>> @@ -10587,7 +10587,7 @@ static int tgsi_bgnloop(struct r600_shader_ctx *ctx)
>>>       /* LOOP_START_DX10 ignores the LOOP_CONFIG* registers, so it is not
>>>        * limited to 4096 iterations, like the other LOOP_* instructions. */
>>>       r600_bytecode_add_cfinst(ctx->bc, CF_OP_LOOP_START_DX10);
>>> -
>>> +     ctx->bc->cf_last->vpm = 1;
>>>       fc_pushlevel(ctx, FC_LOOP);
>>>
>>>       /* check stack depth */
>>>
>>
>> I think you might get incorrect derivatives afterwards (if some pixels
>> in a quad were invalid)? Or are the lanes in the alu still active (I
>> don't really understand all the vpm stuff...) afterwards?
> 
> That I'm not sure about, I think you end up pushing a different mask
> onto the stack for the subsequent alu operations. But I've never fully
> wrapped my brain around this and retained it.
> 
>>
>> I'm wondering if the tests are just bad? You can get lockups rather
>> easily if you're not extra careful with atomics in loops (because the
>> result of the atomic is undefined for helper invocations, and the hw
>> will never store the result).
>> There was recently some discussion about this on mesa-dev, also see
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D106902&data=02%7C01%7Csroland%40vmware.com%7C7171e1fa077247d03cb708d5f8fdfa31%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636688692072096391&sdata=La3khLxm6DWSXwcmTW7SVDm3vQntxA942HQigZQ1NRA%3D&reserved=0
> 
> I expect it's in this category, we get garbage back from the helper
> invocation atomics and the loop loops forever. Setting vpm
> must avoid the complete loop for helper invocations.
> 
> Maybe I can limit the vpm setting to only cases where we have
> atomic operations in the loop to avoid any unexpected fallout.
If the mask for the alu still includes the helper invocations, I'd think
it should be fine (because if the loop mask is really different for
helper and non-helper invocations, it seems likely any derivatives would
be garbage anyway afterwards, since the control flow was previously
different, regardless if the loop condition was actually evaluated for
helper invocations or not).

> 
> Well at the moment fglrx passes the test fine and doesn't hang, it's
> where I got the idea for the vpm here. Now fglrx might just be avoiding
> the problem or it may have a good reason for setting vpm on the for loop,
> but I can't really tell.
> 

Yeah I think getting VPM / WQM correct is tricky...
I do believe though you can actually get hangs with such shaders with
blobs too.

Roland

> 
> Dave.
>