[Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

Sat Apr 16 21:01:17 UTC 2016

Jan Vesely <jan.vesely at rutgers.edu> writes:

> On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote:
>> On 15.04.2016 17:12, Francisco Jerez wrote:
>> > 
>> > > 
>> > > > 
>> > > > > 
>> > > > > > 
>> > > > > > For a test doing almost the same thing but not relying on
>> > > > > > unspecified
>> > > > > > invocation ordering, see
>> > > > > > "tests/spec/arb_shader_image_load_store/shader-mem-
>> > > > > > barrier.c" -- It
>> > > > > > would be interesting to see whether you can get it to
>> > > > > > reproduce the GCN
>> > > > > > coherency bug using different framebuffer size and modulus
>> > > > > > parameters.
>> > > > > I tried that, but couldn't reproduce. Whether I just wasn't
>> > > > > thorough
>> > > > > enough/"unlucky" or whether the in-order nature of the
>> > > > > hardware and L1
>> > > > > cache behavior just makes it impossible to fail the shader-
>> > > > > mem-barrier
>> > > > > test, I'm not sure.
>> > > > > 
>> > > > Now I'm curious about the exact nature of the bug ;), some sort
>> > > > of
>> > > > missing L1 cache-flushing which could potentially affect
>> > > > dependent
>> > > > invocations?
>> > > I'm not sure I remember everything, to be honest.
>> > > 
>> > > One issue that I do remember is that load/store by default go
>> > > through
>> > > L1, but atomics _never_ go through L1, no matter how you compile
>> > > them.
>> > > This means that if you're working on two different images, one
>> > > with
>> > > atomics and the other without, then the atomic one will always
>> > > behave
>> > > coherently but the other one won't unless you explicitly tell it
>> > > to.
>> > > 
>> > > Now that I think about this again, there should probably be a
>> > > shader-mem-barrier-style way to test for that particular issue in
>> > > a way
>> > > that doesn't depend on the specifics of the parallelization.
>> > > Something
>> > > like, in a loop:
>> > > 
>> > > Thread 1: increasing imageStore into image 1 at location 1,
>> > > imageLoad
>> > > from image 1 location 2
>> > > 
>> > > Thread 2: same, but exchange locations 1 and 2
>> > > 
>> > > Both threads: imageAtomicAdd on the same location in image 2
>> > > 
>> > > Then each thread can check that _if_ the imageAtomicAdd detects
>> > > the
>> > > buddy thread operating in parallel, _then_ they must also observe
>> > > incrementing values in the location that the buddy thread stores
>> > > to.
>> > > Does that sound reasonable?
>> > > 
>> > Yeah, that sounds reasonable, but keep in mind that even if both
>> > image
>> > variables are marked coherent you cannot make assumptions about the
>> > ordering of the image stores performed on image 1 relative to the
>> > atomics performed on image 2 unless there is an explicit barrier in
>> > between, which means that some level of L1 caching is legitimate
>> > even in
>> > that scenario (and might have some performance benefit over
>> > skipping L1
>> > caching of coherent images altogether) -- That's in fact the way
>> > that
>> > the i965 driver implements coherent image stores: We just write to
>> > L1
>> > and flush later on to the globally coherent L3 on the next
>> > memoryBarrier().
>> Okay, adding the barrier makes sense.
>> 
>> 
>> > 
>> > What about a test along the lines of the current coherency
>> > test?  Any
>> > idea what's the reason you couldn't get it to reproduce the
>> > issue?  Is
>> > it because threads with dependent inputs are guaranteed to be
>> > spawned in
>> > the same L1 cache domain as the threads that generated their inputs
>> > or
>> > something like that?
>>  From what I understand (though admittedly the documentation I have
>> on 
>> this is not the clearest...), the hardware flushes the L1 cache 
>> automatically at the end of each shader invocation, so that
>> dependent 
>> invocations are guaranteed to pick it up.
>
> The GCN whitepaper mentions both that the L1D cache is write-through
> and sends data to L2 at the end all 64 WF stores (page 9), and that the
> L1 cache writes back data at the end of wavefront or when barrier is
> invoked (page 10).
>

Interesting...  In that case I wonder if there actually was a bug to
test for or you could just call the non-coherent behavior of L1 writes a
feature? ;)

Nicolai, do I have your Acked-by for the revert?

> Jan
>
>> 
>> Cheers,
>> Nicolai
>> _______________________________________________
>> Piglit mailing list
>> Piglit at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/piglit
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/piglit/attachments/20160416/8954abc1/attachment.sig>