[PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Oct 27 19:43:21 UTC 2021
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>> yes
>>
>> Then maybe update the description because 'underflow' is very confusing
>>
> will do
>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>> ping
>>>>>
>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>> [Why]
>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>> is put one more time than other jobs.
>>>>>>
>>>>>> [How]
>>>>>> Adding dma_fence_get before resbumit job in
>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>
>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>> 1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>> drm_sched_reset_karma(s_job);
>>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>> + * to make sure fence is balanced */
>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>
>>>> Andrey
>>>>
>>>>
>>> Hi Andrey,
>>>
>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>
>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>> What if you make dma_get_fence unconditional instead ?
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
I can't see it - can you point me where in that case you get unbalanced
refcount ? As far as I see for a a normal job
being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence
is -
drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
In total refcount didn't change until now
Next, dma_fence_wait_timeout completed successfully because the job is
normal and then you delete that job from pending list and call the
free_job cb which drops remaining refcounts on the hw_fence.
I am probably missing some dma_fence_get since you checked it on a
device but I wonder where is my mistake ?
Andrey
> I have figured out all the get/put
>
> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>
>>>>>> + dma_fence_get(s_job->s_fence->parent);
>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>> /* got the hw fence, signal finished fence */
>>>>>> atomic_dec(ring->sched.score);
>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>> dma_fence_put(&s_job->s_fence->finished);
More information about the amd-gfx
mailing list