[Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

Sat Apr 16 20:06:46 UTC 2016

Nicolai Hähnle <nhaehnle at gmail.com> writes:

> On 15.04.2016 17:12, Francisco Jerez wrote:
>>>>>> For a test doing almost the same thing but not relying on unspecified
>>>>>> invocation ordering, see
>>>>>> "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It
>>>>>> would be interesting to see whether you can get it to reproduce the GCN
>>>>>> coherency bug using different framebuffer size and modulus parameters.
>>>>>
>>>>> I tried that, but couldn't reproduce. Whether I just wasn't thorough
>>>>> enough/"unlucky" or whether the in-order nature of the hardware and L1
>>>>> cache behavior just makes it impossible to fail the shader-mem-barrier
>>>>> test, I'm not sure.
>>>>>
>>>> Now I'm curious about the exact nature of the bug ;), some sort of
>>>> missing L1 cache-flushing which could potentially affect dependent
>>>> invocations?
>>>
>>> I'm not sure I remember everything, to be honest.
>>>
>>> One issue that I do remember is that load/store by default go through
>>> L1, but atomics _never_ go through L1, no matter how you compile them.
>>> This means that if you're working on two different images, one with
>>> atomics and the other without, then the atomic one will always behave
>>> coherently but the other one won't unless you explicitly tell it to.
>>>
>>> Now that I think about this again, there should probably be a
>>> shader-mem-barrier-style way to test for that particular issue in a way
>>> that doesn't depend on the specifics of the parallelization. Something
>>> like, in a loop:
>>>
>>> Thread 1: increasing imageStore into image 1 at location 1, imageLoad
>>> from image 1 location 2
>>>
>>> Thread 2: same, but exchange locations 1 and 2
>>>
>>> Both threads: imageAtomicAdd on the same location in image 2
>>>
>>> Then each thread can check that _if_ the imageAtomicAdd detects the
>>> buddy thread operating in parallel, _then_ they must also observe
>>> incrementing values in the location that the buddy thread stores to.
>>> Does that sound reasonable?
>>>
>> Yeah, that sounds reasonable, but keep in mind that even if both image
>> variables are marked coherent you cannot make assumptions about the
>> ordering of the image stores performed on image 1 relative to the
>> atomics performed on image 2 unless there is an explicit barrier in
>> between, which means that some level of L1 caching is legitimate even in
>> that scenario (and might have some performance benefit over skipping L1
>> caching of coherent images altogether) -- That's in fact the way that
>> the i965 driver implements coherent image stores: We just write to L1
>> and flush later on to the globally coherent L3 on the next
>> memoryBarrier().
>
> Okay, adding the barrier makes sense.
>
>
>> What about a test along the lines of the current coherency test?  Any
>> idea what's the reason you couldn't get it to reproduce the issue?  Is
>> it because threads with dependent inputs are guaranteed to be spawned in
>> the same L1 cache domain as the threads that generated their inputs or
>> something like that?
>
>  From what I understand (though admittedly the documentation I have on 
> this is not the clearest...), the hardware flushes the L1 cache 
> automatically at the end of each shader invocation, so that dependent 
> invocations are guaranteed to pick it up.
>
Ah, interesting.  What about memoryBarrier()?  Does that cause the
back-end compiler to emit an L1 cache flush of some sort?

> Cheers,
> Nicolai
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/piglit/attachments/20160416/790fe2ff/attachment.sig>