[PATCH 1/2] drm/amdkfd: Fix some double free when destroy queue fails

Pan, Xinhui Xinhui.Pan at amd.com
Thu Jun 17 00:48:31 UTC 2021



> 2021年6月17日 06:55,Kuehling, Felix <Felix.Kuehling at amd.com> 写道:
> 
> On 2021-06-16 4:35 a.m., xinhui pan wrote:
>> Some resource are freed even destroy queue fails.
> 
> Looks like you're keeping this behaviour for -ETIME. That is consistent with what pqn_destroy_queue does. What you're fixing here is the behaviour for non-timeout errors. Please make that clear in the patch description.
will do that in v2.

> 
> Out of curiosity, what kind of error were you getting? The only other ones that are not a fatal memory shortage, are some EINVAL cases in pm_unmap_queues_v*. But that would indicate some internal error, that a queue was created with an invalid type, or maybe the queue data structure was somehow corrupted.
> 
This is just because amdkfd_fence_wait_timeout got timeout. execute_queues_cpsch return EIO as dqm->is_hws_hang is true.
hit this issue with kfdtest --gtest_filter=*QM*
> 
>>  That will cause double
>> free when user-space issue another destroy_queue ioctl.
>> 
>> Paste some log below.
>> 
>> amdgpu: Can't create new usermode queue because -1 queues were already
>> created
>> 
>> refcount_t: underflow; use-after-free.
>> Call Trace:
>>  kobject_put+0xe6/0x1b0
>>  kfd_procfs_del_queue+0x37/0x50 [amdgpu]
>>  pqm_destroy_queue+0x17a/0x390 [amdgpu]
>>  kfd_ioctl_destroy_queue+0x57/0xc0 [amdgpu]
>>  kfd_ioctl+0x463/0x690 [amdgpu]
>> 
>> BUG kmalloc-32 (Tainted: G        W        ): Object already free
>> INFO: Allocated in allocate_sdma_mqd+0x30/0xb0 [amdgpu] age=4796 cpu=2
>> pid=2511
>>  __slab_alloc+0x72/0x80
>>  kmem_cache_alloc_trace+0x81f/0x8c0
>>  allocate_sdma_mqd+0x30/0xb0 [amdgpu]
>>  create_queue_cpsch+0xbf/0x470 [amdgpu]
>>  pqm_create_queue+0x28d/0x6d0 [amdgpu]
>>  kfd_ioctl_create_queue+0x492/0xae0 [amdgpu]
>> INFO: Freed in free_mqd_hiq_sdma+0x20/0x60 [amdgpu] age=2537 cpu=7
>> pid=2511
>>  kfree+0x322/0x340
>>  free_mqd_hiq_sdma+0x20/0x60 [amdgpu]
>>  destroy_queue_cpsch+0x20c/0x330 [amdgpu]
>>  pqm_destroy_queue+0x1a3/0x390 [amdgpu]
>>  kfd_ioctl_destroy_queue+0x57/0xc0 [amdgpu]
>> 
>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 2 ++
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c               | 4 +++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 1 +
>>  3 files changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> index e6366b408420..c24ab8f17eb6 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> @@ -1529,6 +1529,8 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
>>  				KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
>>  		if (retval == -ETIME)
>>  			qpd->reset_wavefronts = true;
>> +		else if (retval)
>> +			goto failed_try_destroy_debugged_queue;
>>  		if (q->properties.is_gws) {
>>  			dqm->gws_queue_count--;
>>  			qpd->mapped_gws_queue = false;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 09b98a83f670..984197e5929f 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -607,11 +607,13 @@ static int kfd_procfs_add_sysfs_files(struct kfd_process *p)
>>    void kfd_procfs_del_queue(struct queue *q)
>>  {
>> -	if (!q)
>> +	if (!q || !kobject_get_unless_zero(&q->kobj))
>>  		return;
>>    	kobject_del(&q->kobj);
>>  	kobject_put(&q->kobj);
>> +	/* paired with the get above */
>> +	kobject_put(&q->kobj);
>>  }
>>    int kfd_process_create_wq(void)
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> index 95a6c36cea4c..4fcb64bc43dd 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> @@ -373,6 +373,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
>>  		dqm = pqn->kq->dev->dqm;
>>  		dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
>>  		kernel_queue_uninit(pqn->kq, false);
>> +		pqn->kq = NULL;
> 
> This seems unrelated to this patch. But if you're fixing this, I'd expect a similar fix after uninit_queue(pqn->q).
> 
> Regards,
>   Felix
> 
> 
>>  	}
>>    	if (pqn->q) {

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 15002 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210617/49875db0/attachment-0001.bin>


More information about the amd-gfx mailing list