[Piglit] [Bug] Fragment shader interlock sample shading test assumes wrong guarantees (plain text copy)

Fri Jun 23 21:42:01 UTC 2023

Oops, sorry for HTML in the previous message breaking archiving. 
Resending in a different client as plain text.

Hi!

I'm currently preparing the implementation of fragment shader interlock 
in the Mesa RADV driver for merging (merge request !22250), and one 
Piglit test fails on Zink on RADV with my implementation.

The test is spec/arb_fragment_shader_interlock/image-load-store.

I noticed that it used layout(pixel_interlock_ordered) with sample 
shading. This is a combination that is only available in OpenGL and 
Vulkan, but not in Direct3D, Metal and GL_INTEL_fragment_shader_ordering 
(which force sample interlock if sample shading is used), and given that 
AMD hasn't exposed the KHR/EXT fragment shader interlock in their 
official OpenGL and Vulkan drivers so far (unlike in Direct3D and 
Metal), I was starting to think that they were unable to support the 
full requirements of pixel_interlock at all.

However, they actually do support pixel_interlock with sample shading. I 
was able to verify that by modifying an existing example 
(`Triang3l/vk_order_independent_transparency` on GitHub, 
`radv_sample_shading_test` branch) of fragment shader interlock usage 
with per-pixel data to use MSAA and sample shading, but only accessing 
that data once per pixel for each primitive, by wrapping all pixel-local 
data accesses in:

`if (gl_SampleID == findLSB(gl_SampleMaskIn[0])) { /* ... */ }`

so that for each fragment belonging to one primitive, the pixel-local 
data is accessed from just one sample-rate invocation. With my change, 
as expected with pixel_interlock_ordered, at common edges of adjacent 
primitives, which cover the same pixels, but don't have common covered 
samples, I got pixel values that were stable between frames, plus a huge 
speed loss. And when I changed the mode to sample_interlock_ordered (in 
AMD hardware terms, setting POPS_OVERLAP_NUM_SAMPLES to 
MSAA_EXPOSED_SAMPLES rather than 1x), the edges became unstable and 
noisy, and everything ran much faster. So, AMD GPUs actually support 
pixel_interlock, and if you have POPS_OVERLAP_NUM_SAMPLES = 1x, you 
still get interlock between primitives with no common samples as 
pixel_interlock requires, unlike in Direct3D, even if sample shading is 
used.

So it turns out that the Piglit test itself apparently requires the 
guarantees that neither the OpenGL (explicitly) nor the Vulkan 
specification provides, and that AMD hardware rightfully doesn't 
implement, though Intel (who created that Piglit test) actually happens 
to offer them due to the details of how their GPUs seem to work.

Specifically, the following two parts are problematic:

ivec3 current_sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, 
gl_SampleID);
// ...
imageStore(img_output, current_sample_coord, result);

and:

for (i = 0; i < sample_rate; i++) {
   if (i != gl_SampleID) {
     ivec3 sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, i);
     vec4 sample_color = imageLoad(img_output, sample_coord);
     // ...
   }
}

In this test, each invocation (corresponding to one MSAA sample) writes 
to a location for its sample, and reads the data for all other samples 
in the pixel.

However, most of the time (if a primitive covers more than 1 sample of a 
pixel, thus everywhere in the middle of a primitive) this results in 
invocations accessing data (for other samples in the pixel) written by 
invocations belonging to other samples in __the same primitive__.

According to the specification of GL_ARB_fragment_shader_interlock, no 
synchronization is guaranteed in this case:

     (1) When using multisampling, the OpenGL specification permits
         multiple fragment shader invocations to be generated for a single
         fragment.  For example, per-sample shading using the "sample"
         auxiliary storage qualifier or the MinSampleShading() OpenGL 
API command
         can be used to force per-sample shading.  What execution ordering
         guarantees are provided between fragment shader invocations 
generated
         from the same fragment?

       RESOLVED:  We don't provide any ordering guarantees in this 
extension.
       This implies that when using multisampling, there is no guarantee 
that
       two fragment shader invocations for the same fragment won't be 
executing
       their critical sections concurrently.  This could cause problems for
       algorithms sharing data structures between all the samples of a pixel
       unless accesses to these data structures are performed atomically.

       When using per-sample shading, the interlock we provide *does* 
guarantee
       that no two invocations corresponding to the same sample execute the
       critical section concurrently.  If a separate set of data 
structures is
       provided for each sample, no conflicts should occur within the 
critical
       section.

As the last paragraph of the quote says, "the interlock we provide does 
guarantee that no two invocations corresponding to __the same sample__ 
execute the critical section concurrently". But in this test, the same 
memory location is accessed by invocations corresponding to __different 
samples__. And thus, the regular rule for invocations generated from the 
same fragment applies: "there is no guarantee that two fragment shader 
invocations for the same fragment won't be executing their critical 
sections concurrently".

In Vulkan, although its specification does not apply to OpenGL but still 
can provide additional background information, fragment shader interlock 
synchronization scope is defined in terms of rasterization order:

 > If the PixelInterlockOrderedEXT execution mode is specified, any 
interlocked operations in a fragment shader must happen before 
interlocked operations in fragment shader invocations that execute later 
in rasterization order and cover at least one sample in the same pixel, 
and must happen after interlocked operations in a fragment shader that 
executes earlier in rasterization order and cover at least one sample in 
the same pixel.

 > If the PixelInterlockUnorderedEXT execution mode is specified, any 
interlocked operations in a fragment shader must happen before or after 
interlocked operations in fragment shader invocations that execute 
earlier or later in rasterization order and cover at least one sample in 
the same pixel.

However, no ordering guarantees are specified for the case when shader 
invocations execute __neither earlier nor later__ than each other in 
rasterization order.

It appears that Intel actually provides the guarantee for invocations 
for the same fragment, but it's not something that OpenGL requires, and 
not all vendors are implementing, with AMD being one of those who don't. 
And this issue apparently hasn't manifested itself until now because 
currently Intel is the only vendor with fragment shader interlock 
implemented in the Mesa upstream.

The Vulkan CTS contains some tests that seem to include the combination 
of pixel_interlock and sample shading in 
VK-GL-CTS/external/vulkancts/modules/vulkan/fragment_shader_interlock, 
and both Intel and AMD GPUs pass them successfully. I think the current 
broken test may be replaced with a port of those tests, that cover many 
additional cases like `discard` in various locations relative to the 
critical section.

Thank you!
-- Triang3l