GART write flush error on SI w/ amdgpu

Marek Olšák maraeo at
Fri May 26 15:47:36 UTC 2017

On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaehnle at> wrote:
> Hi all,
> I'm seeing some very strange errors on Verde with CPU readback from GART,
> and am pretty much out of ideas. Some help would be very much appreciated.
> The error manifests with the
> GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
> *not* on radeon. Here's what the test does:
> 1. Upload a texture.
> 2. Read the texture back via a shader that uses shader buffer writes to
> write data to a buffer that is allocated in GART.
> 3. The CPU then reads from the buffer -- and sometimes gets stale data.
> This sequence is repeated for many sub-tests. There are some sub-tests
> the CPU reads stale data from the buffer, i.e. the shader writes simply
> don't make it to the CPU. The tests vary superficially, e.g. the first
> failing test is (almost?) always one where data is written in 16-bit words
> (but there are succeeding sub-tests with 16-bit writes as well).
> The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
> between the fence wait and the return of glMapBuffer does not fix the
> problem. The data must be stuck in a cache somewhere.
> Since the test runs okay with the radeon module, I tried some changes
> on comparing the IB submit between radeon and amdgpu, and based on
> register settings via scans obtained from umr. Some of the things I've
> tried:
> - Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
> set this)
> - Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
> (radeon does this)
> - Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
> (which seems more logical, and is done by gfx7+), or remove the
> corresponding WRITE_DATA entirely
> None of these changes helped.
> What *does* help is adding an artificial wait. Specifically, I'm adding a
> sequence of
> as can be seen in the attached patch. This works around the problem, but
> makes no sense:
> Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
> around the problem. However(!) it does not actually cause the UMD to wait
> any longer than before. Without this change, the UMD immediately sees a
> signaled user fence (and never uses an ioctl to wait), and with this
> it *still* sees a signaled user fence.
> Also, note that the way I've hacked the change, the wait sequence is only
> added for the user fence emit (and I'm using a modified UMD to ensure that
> there is enough memory to be used by the added wait sequence).
> Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
> problem.
> So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
> encourages some part of the GPU to flush the data from wherever it's
> and that's just really bizarre. There must be something really simple I'm
> missing, and any pointers would be appreciated.

Have you tried this?

diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
index 92c09cb..e6ac0ba 100644
--- a/src/gallium/drivers/radeonsi/si_hw_context.c
+++ b/src/gallium/drivers/radeonsi/si_hw_context.c
@@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags,

        /* DRM 3.1.0 doesn't flush TC for VI correctly. */
-       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
+       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
+           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
                ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |

One more cache flush there shouldn't hurt.

Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the amd-gfx mailing list