[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback
wentalou
Wentao.Lou at amd.com
Fri Dec 21 08:25:54 UTC 2018
When 2 rings met timeout at same time, triggered job_timedout separately.
Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock.
Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock.
So dma_fence_remove_callback could not be called immediately.
Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT.
After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery.
job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job.
Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
Signed-off-by: Wentao Lou <Wentao.Lou at amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0a17fb1..fc1d3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
trace_amdgpu_sched_run_job(job);
- if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter))
+ if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) {
+ /* flags might be signaled by unexpected callback, clear it */
+ test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &finished->flags);
dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */
+ }
if (finished->error < 0) {
DRM_INFO("Skip scheduling IBs!\n");
--
2.7.4
More information about the amd-gfx
mailing list