[Piglit] [Bug] Fragment shader interlock sample shading test assumes wrong guarantees (plain text copy)
Triang3l
triang3l at yandex.ru
Fri Jun 23 21:42:01 UTC 2023
Oops, sorry for HTML in the previous message breaking archiving.
Resending in a different client as plain text.
Hi!
I'm currently preparing the implementation of fragment shader interlock
in the Mesa RADV driver for merging (merge request !22250), and one
Piglit test fails on Zink on RADV with my implementation.
The test is spec/arb_fragment_shader_interlock/image-load-store.
I noticed that it used layout(pixel_interlock_ordered) with sample
shading. This is a combination that is only available in OpenGL and
Vulkan, but not in Direct3D, Metal and GL_INTEL_fragment_shader_ordering
(which force sample interlock if sample shading is used), and given that
AMD hasn't exposed the KHR/EXT fragment shader interlock in their
official OpenGL and Vulkan drivers so far (unlike in Direct3D and
Metal), I was starting to think that they were unable to support the
full requirements of pixel_interlock at all.
However, they actually do support pixel_interlock with sample shading. I
was able to verify that by modifying an existing example
(`Triang3l/vk_order_independent_transparency` on GitHub,
`radv_sample_shading_test` branch) of fragment shader interlock usage
with per-pixel data to use MSAA and sample shading, but only accessing
that data once per pixel for each primitive, by wrapping all pixel-local
data accesses in:
`if (gl_SampleID == findLSB(gl_SampleMaskIn[0])) { /* ... */ }`
so that for each fragment belonging to one primitive, the pixel-local
data is accessed from just one sample-rate invocation. With my change,
as expected with pixel_interlock_ordered, at common edges of adjacent
primitives, which cover the same pixels, but don't have common covered
samples, I got pixel values that were stable between frames, plus a huge
speed loss. And when I changed the mode to sample_interlock_ordered (in
AMD hardware terms, setting POPS_OVERLAP_NUM_SAMPLES to
MSAA_EXPOSED_SAMPLES rather than 1x), the edges became unstable and
noisy, and everything ran much faster. So, AMD GPUs actually support
pixel_interlock, and if you have POPS_OVERLAP_NUM_SAMPLES = 1x, you
still get interlock between primitives with no common samples as
pixel_interlock requires, unlike in Direct3D, even if sample shading is
used.
So it turns out that the Piglit test itself apparently requires the
guarantees that neither the OpenGL (explicitly) nor the Vulkan
specification provides, and that AMD hardware rightfully doesn't
implement, though Intel (who created that Piglit test) actually happens
to offer them due to the details of how their GPUs seem to work.
Specifically, the following two parts are problematic:
ivec3 current_sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y,
gl_SampleID);
// ...
imageStore(img_output, current_sample_coord, result);
and:
for (i = 0; i < sample_rate; i++) {
if (i != gl_SampleID) {
ivec3 sample_coord = ivec3(gl_FragCoord.x, gl_FragCoord.y, i);
vec4 sample_color = imageLoad(img_output, sample_coord);
// ...
}
}
In this test, each invocation (corresponding to one MSAA sample) writes
to a location for its sample, and reads the data for all other samples
in the pixel.
However, most of the time (if a primitive covers more than 1 sample of a
pixel, thus everywhere in the middle of a primitive) this results in
invocations accessing data (for other samples in the pixel) written by
invocations belonging to other samples in __the same primitive__.
According to the specification of GL_ARB_fragment_shader_interlock, no
synchronization is guaranteed in this case:
(1) When using multisampling, the OpenGL specification permits
multiple fragment shader invocations to be generated for a single
fragment. For example, per-sample shading using the "sample"
auxiliary storage qualifier or the MinSampleShading() OpenGL
API command
can be used to force per-sample shading. What execution ordering
guarantees are provided between fragment shader invocations
generated
from the same fragment?
RESOLVED: We don't provide any ordering guarantees in this
extension.
This implies that when using multisampling, there is no guarantee
that
two fragment shader invocations for the same fragment won't be
executing
their critical sections concurrently. This could cause problems for
algorithms sharing data structures between all the samples of a pixel
unless accesses to these data structures are performed atomically.
When using per-sample shading, the interlock we provide *does*
guarantee
that no two invocations corresponding to the same sample execute the
critical section concurrently. If a separate set of data
structures is
provided for each sample, no conflicts should occur within the
critical
section.
As the last paragraph of the quote says, "the interlock we provide does
guarantee that no two invocations corresponding to __the same sample__
execute the critical section concurrently". But in this test, the same
memory location is accessed by invocations corresponding to __different
samples__. And thus, the regular rule for invocations generated from the
same fragment applies: "there is no guarantee that two fragment shader
invocations for the same fragment won't be executing their critical
sections concurrently".
In Vulkan, although its specification does not apply to OpenGL but still
can provide additional background information, fragment shader interlock
synchronization scope is defined in terms of rasterization order:
> If the PixelInterlockOrderedEXT execution mode is specified, any
interlocked operations in a fragment shader must happen before
interlocked operations in fragment shader invocations that execute later
in rasterization order and cover at least one sample in the same pixel,
and must happen after interlocked operations in a fragment shader that
executes earlier in rasterization order and cover at least one sample in
the same pixel.
> If the PixelInterlockUnorderedEXT execution mode is specified, any
interlocked operations in a fragment shader must happen before or after
interlocked operations in fragment shader invocations that execute
earlier or later in rasterization order and cover at least one sample in
the same pixel.
However, no ordering guarantees are specified for the case when shader
invocations execute __neither earlier nor later__ than each other in
rasterization order.
It appears that Intel actually provides the guarantee for invocations
for the same fragment, but it's not something that OpenGL requires, and
not all vendors are implementing, with AMD being one of those who don't.
And this issue apparently hasn't manifested itself until now because
currently Intel is the only vendor with fragment shader interlock
implemented in the Mesa upstream.
The Vulkan CTS contains some tests that seem to include the combination
of pixel_interlock and sample shading in
VK-GL-CTS/external/vulkancts/modules/vulkan/fragment_shader_interlock,
and both Intel and AMD GPUs pass them successfully. I think the current
broken test may be replaced with a port of those tests, that cover many
additional cases like `discard` in various locations relative to the
critical section.
Thank you!
-- Triang3l
More information about the Piglit
mailing list