[PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
Christian König
christian.koenig at amd.com
Fri Jun 24 06:00:10 UTC 2022
Am 23.06.22 um 23:18 schrieb Andrey Grodzovsky:
>
> On 2022-06-22 11:04, Christian König wrote:
>> Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-06-22 05:00, Christian König wrote:
>>>> Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
>>>>> On 2022-06-21 03:19, Christian König wrote:
>>>>>
>>>>>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>>>>>> Problem:
>>>>>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>>>>>> but it drops only by 1.
>>>>>>>
>>>>>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>>>>>> dma_fence_get -> refcount 2
>>>>>>> dme_fence_put -> refcount 1
>>>>>>>
>>>>>>> Fix:
>>>>>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>>>>>
>>>>>> Well what is the external_hw_fence good for in this construct?
>>>>>
>>>>>
>>>>> As far as I understand for direct submissions you don't want to
>>>>> pass a job
>>>>> pointer to ib_schedule and so u can't use the embedded fence for
>>>>> this case.
>>>>
>>>> Can you please look a bit deeper into this, we now have a couple of
>>>> fields in the job structure which have no obvious use.
>>>>
>>>> I think we could pass a job structure to ib_schedule even for
>>>> direct submit now.
>>>
>>>
>>> Are you sure ? I see a lot of activities in amdgpu_ib_schedule
>>> depend on presence of vm and fence_ctx which are set if the job
>>> pointer argument != NULL, might this have a negative impact on
>>> direct submit ?
>>
>> Not 100% sure, but we did tons of workarounds because we didn't had a
>> job pointer for direct submit.
>>
>> But this was before we embedded the IBs at the end of the job.
>>
>> It's quite likely that this should be possible now, it's just that
>> somebody needs to double check.
>>
>> Christian.
>
>
> Looking more i see stuff like amdgpu_vm_flush and
> amdgpu_ring_emit_cntxcntl, emit_frame_cntl that are conditioned on job
> argument, doesn't look to me like this is relevant to direct submit ?
I think that those could and maybe even should be cleaned up.
>
> I also noticed that direct submit passes back the created fence to
> it's caller while freeing the job immediately, Using embedded job here
> will increase the time the job object will hang around the memory
> without any use as long as it's fence is referenced. Job object is
> much larger then a single fence.
Ah! Yes, you are right! That was the fundamental problem we ran into:
When we submit the IB tests during GPU reset we can't allocate any memory!
Ok, that needs further investigation. Please go ahead with your plan to
fix this first and then clean it up later on.
Regards,
Christian.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> index 10aa073600d4..58568fdde2d0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct
>>>>>>> drm_sched_job *s_job)
>>>>>>> /* only put the hw fence if has embedded fence */
>>>>>>> if (job->hw_fence.ops != NULL)
>>>>>>> dma_fence_put(&job->hw_fence);
>>>>>>> - else
>>>>>>> + else {
>>>>>>
>>>>>> When one side of the if uses {} the other side should use {} as
>>>>>> well, e.g. use } else { here.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> + dma_fence_put(job->external_hw_fence);
>>>>>>> kfree(job);
>>>>>>> + }
>>>>>>> }
>>>>>>> void amdgpu_job_free(struct amdgpu_job *job)
>>>>>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>>>> /* only put the hw fence if has embedded fence */
>>>>>>> if (job->hw_fence.ops != NULL)
>>>>>>> dma_fence_put(&job->hw_fence);
>>>>>>> - else
>>>>>>> + else {
>>>>>>> + dma_fence_put(job->external_hw_fence);
>>>>>>> kfree(job);
>>>>>>> + }
>>>>>>> }
>>>>>>> int amdgpu_job_submit(struct amdgpu_job *job, struct
>>>>>>> drm_sched_entity *entity,
>>>>>>
>>>>
>>
More information about the amd-gfx
mailing list