[PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Luben Tuikov luben.tuikov at amd.com
Thu Mar 16 03:41:09 UTC 2023


On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Hi YuBiao,

Thanks for adding the clarifying comments and sending a v2 of this patch.

Perhaps move the chunk you're adding, to sit before, rather than after,
the statements of the if-conditional. Also move the "job" variable
declaration to be inside the if-conditional, since it is not used
by anything outside it. Something like this, (note a few small fixes to the comment),
	if (old && old->ops == &amdgpu_job_fence_ops) {
		struct amdgpu_job *job;

		/* For non-scheduler bad job, i.e. failed IB test, we need to signal
		 * it right here or we won't be able to track them in fence_drv
		 * and they will remain unsignaled during sa_bo free.
		 */
		job = container_of(old, struct amdgpu_job, hw_fence);
		if (!job->base.s_fence && !dma_fence_is_signaled(old))
			dma_fence_signal(old);
		RCU_INIT_POINTER(*ptr, NULL);
		dma_fence_put(old);
	}
Then, give it a test.
With that change, and upon successful test results, this patch is,
Acked-by: Luben Tuikov <luben.tuikov at amd.com>
-- 
Regards,
Luben



More information about the amd-gfx mailing list