[PATCH 3/3] drm/amdgpu: skip put fence if signal fails
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Fri Jul 15 15:42:42 UTC 2022
On 2022-07-15 05:28, Zhu, Jiadong wrote:
> [AMD Official Use Only - General]
>
> Updated some comments
>
> -----Original Message-----
> From: Zhu, Jiadong
> Sent: Friday, July 15, 2022 5:13 PM
> To: Christian König <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Cc: Huang, Ray <Ray.Huang at amd.com>; Liu, Aaron <Aaron.Liu at amd.com>
> Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails
>
> Hi Christian,
>
> The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same hw fence because of this commit:
>
> static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
> struct drm_sched_job *s_job;
> struct dma_fence *fence;
>
> spin_lock(&sched->job_list_lock);
> list_for_each_entry(s_job, &sched->pending_list, list) {
> fence = sched->ops->run_job(s_job); //fence returned has the same address with swapped fences
> dma_fence_put(fence);
> }
> spin_unlock(&sched->job_list_lock);
> }
>
>
>
> commit c530b02f39850a639b72d01ebbf7e5d745c60831
> Author: Jack Zhang <Jack.Zhang1 at amd.com>
> Date: Wed May 12 15:06:35 2021 +0800
>
> drm/amd/amdgpu embed hw_fence into amdgpu_job
>
> Why: Previously hw fence is alloced separately with job.
> It caused historical lifetime issues and corner cases.
> The ideal situation is to take fence to manage both job
> and fence's lifetime, and simplify the design of gpu-scheduler.
>
> How:
> We propose to embed hw_fence into amdgpu_job.
> 1. We cover the normal job submission by this method.
> 2. For ib_test, and submit without a parent job keep the
> legacy way to create a hw fence separately.
> v2:
> use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
> embedded in a job.
> v3:
> remove redundant variable ring in amdgpu_job
> v4:
> add tdr sequence support for this feature. Add a job_run_counter to
> indicate whether this job is a resubmit job.
> v5
> add missing handling in amdgpu_fence_enable_signaling
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
> Signed-off-by: Jack Zhang <Jack.Zhang7 at hotmail.com>
> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> Reviewed by: Monk Liu <monk.liu at amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>
>
> Thus the fence we swapped out is signaled and put twice in the following 2 functions and we get " refcount_t: underflow; use-after-free. " errors latter.
>
> /* wait for jobs finished */
> amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence which is signaled and put somewhere else. The refcount decreased by 1 after amdgpu_fence_wait_empty.
>
> /* signal the old fences */
> amdgpu_ib_preempt_signal_fences(fences, length); //signal and put the previous swapped fence, signal would return -22.
>
> Thanks,
> Jiadong
Did you have 'drm/amdgpu: Follow up change to previous drm scheduler
change.' this commit in your branch while you encountered this problem ?
I don't see an underflow issue
for the preempted job when inspecting the code with this commit in mind -
amdgpu_fence_emit
dma_fence_init 1
dma_fence_get(fence) 2
rcu_assign_pointer(*ptr, dma_fence_get(fence) 3
drm_sched_main
s_fence->parent = dma_fence_get(fence); 4
dma_fence_put(fence); 3
amdgpu_ib_preempt_job_recovery
amdgpu_fence_emit
if (job && job->job_run_counter) -> dma_fence_get(fence); 4
rcu_assign_pointer(*ptr, dma_fence_get(fence)); 5
dma_fence_put(fence); 4
amdgpu_fence_wait_empty
dma_fence_get_rcu(fence) 5
dma_fence_put(fence) 4
amdgpu_process_fence (EOP interrupt for re-submission of preempted job)
dma_fence_put 3
amdgpu_ib_preempt_signal_fences
dma_fence_put 2
amdgpu_job_free_cb
dma_fence_put(&job->hw_fence) 1
drm_sched_fence_release_scheduled
dma_fence_put(fence->parent); 0
Also take a look here for reference -
https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view
Andrey
Andrey
>
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Friday, July 15, 2022 4:48 PM
> To: Zhu, Jiadong <Jiadong.Zhu at amd.com>; amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Cc: Huang, Ray <Ray.Huang at amd.com>; Liu, Aaron <Aaron.Liu at amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails
>
> [CAUTION: External Email]
>
> Am 15.07.22 um 10:43 schrieb jiadong.zhu at amd.com:
>> From: "Jiadong.Zhu" <Jiadong.Zhu at amd.com>
>>
>> Dma_fence_signal returning non-zero indicates that the fence is
>> signaled and put somewhere else.
>> Skip dma_fence_put to make the fence refcount correct.
> Well quite a big NAK on this.
>
> Reference counting should be completely independent where a fence signals.
>
> Andrey can you take a look at this as well?
>
> Thanks,
> Christian.
>
>> Signed-off-by: Jiadong.Zhu <Jiadong.Zhu at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index f4ed0785d523..93c1a5e83835 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct dma_fence **fences,
>> fence = fences[i];
>> if (!fence)
>> continue;
>> - dma_fence_signal(fence);
>> - dma_fence_put(fence);
>> + if (!dma_fence_signal(fence))
>> + dma_fence_put(fence);
>> }
>> }
>>
More information about the amd-gfx
mailing list