[Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

Francisco Jerez currojerez at riseup.net
Sat Apr 16 22:33:36 UTC 2016

Nicolai Hähnle <nicolai.haehnle at amd.com> writes:

> On 16.04.2016 16:01, Francisco Jerez wrote:
>> Jan Vesely <jan.vesely at rutgers.edu> writes:
>>> On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote:
>>>> On 15.04.2016 17:12, Francisco Jerez wrote:
>>>>>>>>> For a test doing almost the same thing but not relying on
>>>>>>>>> unspecified
>>>>>>>>> invocation ordering, see
>>>>>>>>> "tests/spec/arb_shader_image_load_store/shader-mem-
>>>>>>>>> barrier.c" -- It
>>>>>>>>> would be interesting to see whether you can get it to
>>>>>>>>> reproduce the GCN
>>>>>>>>> coherency bug using different framebuffer size and modulus
>>>>>>>>> parameters.
>>>>>>>> I tried that, but couldn't reproduce. Whether I just wasn't
>>>>>>>> thorough
>>>>>>>> enough/"unlucky" or whether the in-order nature of the
>>>>>>>> hardware and L1
>>>>>>>> cache behavior just makes it impossible to fail the shader-
>>>>>>>> mem-barrier
>>>>>>>> test, I'm not sure.
>>>>>>> Now I'm curious about the exact nature of the bug ;), some sort
>>>>>>> of
>>>>>>> missing L1 cache-flushing which could potentially affect
>>>>>>> dependent
>>>>>>> invocations?
>>>>>> I'm not sure I remember everything, to be honest.
>>>>>> One issue that I do remember is that load/store by default go
>>>>>> through
>>>>>> L1, but atomics _never_ go through L1, no matter how you compile
>>>>>> them.
>>>>>> This means that if you're working on two different images, one
>>>>>> with
>>>>>> atomics and the other without, then the atomic one will always
>>>>>> behave
>>>>>> coherently but the other one won't unless you explicitly tell it
>>>>>> to.
>>>>>> Now that I think about this again, there should probably be a
>>>>>> shader-mem-barrier-style way to test for that particular issue in
>>>>>> a way
>>>>>> that doesn't depend on the specifics of the parallelization.
>>>>>> Something
>>>>>> like, in a loop:
>>>>>> Thread 1: increasing imageStore into image 1 at location 1,
>>>>>> imageLoad
>>>>>> from image 1 location 2
>>>>>> Thread 2: same, but exchange locations 1 and 2
>>>>>> Both threads: imageAtomicAdd on the same location in image 2
>>>>>> Then each thread can check that _if_ the imageAtomicAdd detects
>>>>>> the
>>>>>> buddy thread operating in parallel, _then_ they must also observe
>>>>>> incrementing values in the location that the buddy thread stores
>>>>>> to.
>>>>>> Does that sound reasonable?
>>>>> Yeah, that sounds reasonable, but keep in mind that even if both
>>>>> image
>>>>> variables are marked coherent you cannot make assumptions about the
>>>>> ordering of the image stores performed on image 1 relative to the
>>>>> atomics performed on image 2 unless there is an explicit barrier in
>>>>> between, which means that some level of L1 caching is legitimate
>>>>> even in
>>>>> that scenario (and might have some performance benefit over
>>>>> skipping L1
>>>>> caching of coherent images altogether) -- That's in fact the way
>>>>> that
>>>>> the i965 driver implements coherent image stores: We just write to
>>>>> L1
>>>>> and flush later on to the globally coherent L3 on the next
>>>>> memoryBarrier().
>>>> Okay, adding the barrier makes sense.
>>>>> What about a test along the lines of the current coherency
>>>>> test?  Any
>>>>> idea what's the reason you couldn't get it to reproduce the
>>>>> issue?  Is
>>>>> it because threads with dependent inputs are guaranteed to be
>>>>> spawned in
>>>>> the same L1 cache domain as the threads that generated their inputs
>>>>> or
>>>>> something like that?
>>>>   From what I understand (though admittedly the documentation I have
>>>> on
>>>> this is not the clearest...), the hardware flushes the L1 cache
>>>> automatically at the end of each shader invocation, so that
>>>> dependent
>>>> invocations are guaranteed to pick it up.
>>> The GCN whitepaper mentions both that the L1D cache is write-through
>>> and sends data to L2 at the end all 64 WF stores (page 9), and that the
>>> L1 cache writes back data at the end of wavefront or when barrier is
>>> invoked (page 10).
>> Interesting...  In that case I wonder if there actually was a bug to
>> test for or you could just call the non-coherent behavior of L1 writes a
>> feature? ;)
>> Nicolai, do I have your Acked-by for the revert?
> Yes, I should have made that clearer in my first reply :)
Thanks, pushed.

> Anyway, there _is_ still something that needs to be tested. Whereas 
> coherency only worries about stores being propagated to dependent 
> invocations, and shader-mem-barrier only worries about stores being 
> propagated _in the correct order_, there needs to be a test that worries 
> about stores being propagated _at all_ (because of the mentioned 
> difference in GCN between atomics and load/stores). But I'll follow up 
> with such a test soon.

The problem is how do you define "at all" in a way independent from the
unspecified shader program and memory transaction execution order?  At
all as in before the originating shader invocation completes?  An
implicit L1 flush at the end of the program would certainly guarantee
that.  At all as in not after any strictly ordered memory transaction?
It sounds like your memory barrier implementation may already make sure
of that.  At all as in before another shader invocation executes up to
some specified point? -- Then you may be out of luck if the shader
invocations are independent from each other and their execution order is

> Nicolai
>>> Jan
>>>> Cheers,
>>>> Nicolai
>>>> _______________________________________________
>>>> Piglit mailing list
>>>> Piglit at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/piglit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/piglit/attachments/20160416/ad80bdbd/attachment.sig>

More information about the Piglit mailing list