[PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Wang, YuBiao YuBiao.Wang at amd.com
Tue Mar 14 06:33:48 UTC 2023


Hi Luben,

I'd have to ping you because we've got a P1 ticket currently on this issue. Would you please give a vague time when would you confirm whether this patch is safe? Thank you a lot for helping double check this.

Regards & Thanks,
Yubiao 

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov at amd.com> 
Sent: Saturday, March 11, 2023 12:56 AM
To: Wang, YuBiao <YuBiao.Wang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Quan, Evan <Evan.Quan at amd.com>; Chen, Horace <Horace.Chen at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>
Subject: Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a 
> failed ib test before mode 1 reset during asic reset. The fences in 
> this case are never signaled and next time when we try to free the 
> sa_bo, kernel will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the 
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be 
> released in job_free_cb, and for non-sched jobs like ib_test, it's 
> meant to be released in sa_bo_free but only when the fences are 
> signaled. So we have to force signal the non_sched bad job's fence 
> during pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which worries me is this check here, !job->base.s_fence, which is used here to qualify that we can signal the fence (and of course that the fence is not yet signalled.) We need to audit this check to make sure that it is not overloaded to mean other things. I'll take a look.

>  		}
>  	}
>  }

--
Regards,
Luben



More information about the amd-gfx mailing list