[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

Fri Dec 21 08:25:54 UTC 2018

When 2 rings met timeout at same time, triggered job_timedout separately.
Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock.
Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock.
So dma_fence_remove_callback could not be called immediately.
Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT.
After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery.
job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job.

Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
Signed-off-by: Wentao Lou <Wentao.Lou at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0a17fb1..fc1d3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 
 	trace_amdgpu_sched_run_job(job);
 
-	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter))
+	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) {
+		/* flags might be signaled by unexpected callback, clear it */
+		test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &finished->flags);
 		dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */
+	}
 
 	if (finished->error < 0) {
 		DRM_INFO("Skip scheduling IBs!\n");
-- 
2.7.4