[PATCH 3/3] drm/amdgpu: skip put fence if signal fails

Zhu, Jiadong Jiadong.Zhu at amd.com
Fri Jul 15 09:28:00 UTC 2022


[AMD Official Use Only - General]

Updated some comments

-----Original Message-----
From: Zhu, Jiadong
Sent: Friday, July 15, 2022 5:13 PM
To: Christian König <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
Cc: Huang, Ray <Ray.Huang at amd.com>; Liu, Aaron <Aaron.Liu at amd.com>
Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

Hi Christian,

The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same hw fence because of this commit:

static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
        struct drm_sched_job *s_job;
        struct dma_fence *fence;

        spin_lock(&sched->job_list_lock);
        list_for_each_entry(s_job, &sched->pending_list, list) {
                fence = sched->ops->run_job(s_job);       //fence returned has the same address with swapped fences
                dma_fence_put(fence);
        }
        spin_unlock(&sched->job_list_lock);
}



commit c530b02f39850a639b72d01ebbf7e5d745c60831
Author: Jack Zhang <Jack.Zhang1 at amd.com>
Date:   Wed May 12 15:06:35 2021 +0800

    drm/amd/amdgpu embed hw_fence into amdgpu_job

    Why: Previously hw fence is alloced separately with job.
    It caused historical lifetime issues and corner cases.
    The ideal situation is to take fence to manage both job
    and fence's lifetime, and simplify the design of gpu-scheduler.

    How:
    We propose to embed hw_fence into amdgpu_job.
    1. We cover the normal job submission by this method.
    2. For ib_test, and submit without a parent job keep the
    legacy way to create a hw fence separately.
    v2:
    use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
    embedded in a job.
    v3:
    remove redundant variable ring in amdgpu_job
    v4:
    add tdr sequence support for this feature. Add a job_run_counter to
    indicate whether this job is a resubmit job.
    v5
    add missing handling in amdgpu_fence_enable_signaling

    Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
    Signed-off-by: Jack Zhang <Jack.Zhang7 at hotmail.com>
    Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
    Reviewed by: Monk Liu <monk.liu at amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher at amd.com>


Thus the fence we swapped out is signaled and put twice in the following 2 functions and we get " refcount_t: underflow; use-after-free. " errors latter.

                /* wait for jobs finished */
                amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence which is signaled and put somewhere else. The refcount decreased by 1 after amdgpu_fence_wait_empty.

                /* signal the old fences */
                amdgpu_ib_preempt_signal_fences(fences, length);   //signal and put the previous swapped fence, signal would return -22.

Thanks,
Jiadong


-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com>
Sent: Friday, July 15, 2022 4:48 PM
To: Zhu, Jiadong <Jiadong.Zhu at amd.com>; amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
Cc: Huang, Ray <Ray.Huang at amd.com>; Liu, Aaron <Aaron.Liu at amd.com>
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

[CAUTION: External Email]

Am 15.07.22 um 10:43 schrieb jiadong.zhu at amd.com:
> From: "Jiadong.Zhu" <Jiadong.Zhu at amd.com>
>
> Dma_fence_signal returning non-zero indicates that the fence is
> signaled and put somewhere else.
> Skip dma_fence_put to make the fence refcount correct.

Well quite a big NAK on this.

Reference counting should be completely independent where a fence signals.

Andrey can you take a look at this as well?

Thanks,
Christian.

>
> Signed-off-by: Jiadong.Zhu <Jiadong.Zhu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index f4ed0785d523..93c1a5e83835 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct dma_fence **fences,
>               fence = fences[i];
>               if (!fence)
>                       continue;
> -             dma_fence_signal(fence);
> -             dma_fence_put(fence);
> +             if (!dma_fence_signal(fence))
> +                     dma_fence_put(fence);
>       }
>   }
>



More information about the amd-gfx mailing list