[PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
JingWen Chen
jingwech at amd.com
Fri Oct 22 08:14:39 UTC 2021
ping
On 2021/10/22 AM11:33, Jingwen Chen wrote:
> [Why]
> In advance tdr mode, the real bad job will be resubmitted twice, while
> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
> is put one more time than other jobs.
>
> [How]
> Adding dma_fence_get before resbumit job in
> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 41ce86244144..975f069f6fe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>
> /* clear job's guilty and depend the folowing step to decide the real one */
> drm_sched_reset_karma(s_job);
> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
> + * to make sure fence is balanced */
> + dma_fence_get(s_job->s_fence->parent);
> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>
> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>
> /* got the hw fence, signal finished fence */
> atomic_dec(ring->sched.score);
> + dma_fence_put(s_job->s_fence->parent);
> dma_fence_get(&s_job->s_fence->finished);
> dma_fence_signal(&s_job->s_fence->finished);
> dma_fence_put(&s_job->s_fence->finished);
More information about the amd-gfx
mailing list