[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

Grodzovsky, Andrey Andrey.Grodzovsky at amd.com
Fri Dec 21 16:57:09 UTC 2018


I believe this issue would be resolved by my pending  in review patch 
set, specifically 'drm/sched: Refactor ring mirror list handling.' since 
already on the first TO handler it will go over all the rings including 
the second timed out ring and will remove all call backs including the 
bad job cb. In case by this time this bad job will signal for some 
reason it will be removed from the mirror list already during 
drm_sched_process_job (take a look at 'drm/sched: Rework HW fence 
processing.') and hence will not be rerun in drm_sched_job_recovery 
(drm_sched_resubmit_jobs under the new name).

Andrey


On 12/21/2018 03:25 AM, wentalou wrote:
> When 2 rings met timeout at same time, triggered job_timedout separately.
> Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock.
> Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock.
> So dma_fence_remove_callback could not be called immediately.
> Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT.
> After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery.
> job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job.
>
> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
> Signed-off-by: Wentao Lou <Wentao.Lou at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0a17fb1..fc1d3a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   
>   	trace_amdgpu_sched_run_job(job);
>   
> -	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter))
> +	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) {
> +		/* flags might be signaled by unexpected callback, clear it */
> +		test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &finished->flags);
>   		dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */
> +	}
>   
>   	if (finished->error < 0) {
>   		DRM_INFO("Skip scheduling IBs!\n");



More information about the amd-gfx mailing list