[PATCH 1/1] drm/amdkfd: Remove queue sysfs and doorbell after unmapping

Felix Kuehling felix.kuehling at amd.com
Tue Jul 5 19:12:19 UTC 2022


On 2022-07-05 15:02, philip yang wrote:
>
>
> On 2022-07-05 14:48, Felix Kuehling wrote:
>> On 2022-06-10 21:03, Philip Yang wrote:
>>> If destroying/unmapping queue failed, application may destroy queue
>>> again, cause below kernel warning backtrace.
>>>
>>> For outstanding queues, either applications forget to destroy or failed
>>> to destroy, kfd_process_notifier_release will remove queue sysfs
>>> objects, kfd_process_wq_release will free queue doorbell.
>>>
>>>   refcount_t: underflow; use-after-free.
>>>   WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
>>>    Call Trace:
>>>     kobject_put+0xd6/0x1a0
>>>     kfd_procfs_del_queue+0x27/0x30 [amdgpu]
>>>     pqm_destroy_queue+0xeb/0x240 [amdgpu]
>>>     kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
>>>     kfd_ioctl+0x27d/0x500 [amdgpu]
>>>     do_syscall_64+0x35/0x80
>>>
>>>   WARNING: CPU: 2 PID: 3053 at 
>>> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
>>>    Call Trace:
>>>     deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
>>>     destroy_queue_cpsch+0xb3/0x270 [amdgpu]
>>>     pqm_destroy_queue+0x108/0x240 [amdgpu]
>>>     kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
>>>     kfd_ioctl+0x27d/0x500 [amdgpu]
>>>
>>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 4 ++--
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +-
>>>   2 files changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> index e1797657b04c..1c519514ca1a 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> @@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct 
>>> device_queue_manager *dqm,
>>>       mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
>>>               q->properties.type)];
>>>   -    deallocate_doorbell(qpd, q);
>>> -
>>>       if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
>>>           (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)) {
>>>           deallocate_sdma_queue(dqm, q);
>>> @@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct 
>>> device_queue_manager *dqm,
>>>           }
>>>       }
>>>   +    deallocate_doorbell(qpd, q);
>>> +
>>
>> I'm not sure what this change is meant to do. I don't see an early 
>> return in this function, so deallocate_doorbell will always be 
>> executed either way.
>
> If app destroy queue deallocate_doorbell, but then unmap queue failed, 
> app will destroy queue again when app exit, deallocate_doorbell second 
> time will trigger the WARN backtrace.
>
I get that. But even with your change, deallocate_doorbell will still be 
executed if the queue unmap fails because there is no early return or 
"goto error".


> As queue_count and q->list is used to alloc ring buf in 
> execute_queues_cpsch, so this change causes regression on gfx9, I will 
> submit new patch to handle unmap failed case with MES and fix those 
> issues.
>
I think the intention of the code was to treat HWS in a way that does 
not prevent queue destruction. Basically, there is not point reporting 
HWS errors to the application because the application cannot do anything 
about them anyway. Eventually it will cause a GPU reset and the 
application will be killed. If you look at how -ETIME is handled in 
pqm_destroy_queue, you see that it finishes the job regardless.

However, this has always been somewhat inconsistent. With MES maybe it's 
getting worse because there may be other error conditions we didn't hit 
before. Are you seeing those backtraces on a GPU with MES by any chance?

Regards,
   Felix


> Regards,
>
> Philip
>
>>
>>>       /*
>>>        * Unconditionally decrement this counter, regardless of the 
>>> queue's
>>>        * type
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> index dc00484ff484..99f2a6412201 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> @@ -419,7 +419,6 @@ int pqm_destroy_queue(struct 
>>> process_queue_manager *pqm, unsigned int qid)
>>>       }
>>>         if (pqn->q) {
>>> -        kfd_procfs_del_queue(pqn->q);
>>>           dqm = pqn->q->device->dqm;
>>>           retval = dqm->ops.destroy_queue(dqm, &pdd->qpd, pqn->q);
>>>           if (retval) {
>>> @@ -439,6 +438,7 @@ int pqm_destroy_queue(struct 
>>> process_queue_manager *pqm, unsigned int qid)
>>>           if (dev->shared_resources.enable_mes)
>>>               amdgpu_amdkfd_free_gtt_mem(dev->adev,
>>>                              pqn->q->gang_ctx_bo);
>>> +        kfd_procfs_del_queue(pqn->q);
>>
>> This part makes sense.
>>
>> Regards,
>>   Felix
>>
>>
>>>           uninit_queue(pqn->q);
>>>       }


More information about the amd-gfx mailing list