<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2022-07-05 14:48, Felix Kuehling
wrote:<br>
</div>
<blockquote type="cite" cite="mid:2f71d9b5-59aa-45cb-0245-7dc139f108b1@amd.com">On
2022-06-10 21:03, Philip Yang wrote:
<br>
<blockquote type="cite">If destroying/unmapping queue failed,
application may destroy queue
<br>
again, cause below kernel warning backtrace.
<br>
<br>
For outstanding queues, either applications forget to destroy or
failed
<br>
to destroy, kfd_process_notifier_release will remove queue sysfs
<br>
objects, kfd_process_wq_release will free queue doorbell.
<br>
<br>
refcount_t: underflow; use-after-free.
<br>
WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
<br>
Call Trace:
<br>
kobject_put+0xd6/0x1a0
<br>
kfd_procfs_del_queue+0x27/0x30 [amdgpu]
<br>
pqm_destroy_queue+0xeb/0x240 [amdgpu]
<br>
kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
<br>
kfd_ioctl+0x27d/0x500 [amdgpu]
<br>
do_syscall_64+0x35/0x80
<br>
<br>
WARNING: CPU: 2 PID: 3053 at
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
<br>
Call Trace:
<br>
deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
<br>
destroy_queue_cpsch+0xb3/0x270 [amdgpu]
<br>
pqm_destroy_queue+0x108/0x240 [amdgpu]
<br>
kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
<br>
kfd_ioctl+0x27d/0x500 [amdgpu]
<br>
<br>
Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
<br>
---
<br>
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 4
++--
<br>
drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +-
<br>
2 files changed, 3 insertions(+), 3 deletions(-)
<br>
<br>
diff --git
a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
<br>
index e1797657b04c..1c519514ca1a 100644
<br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
<br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
<br>
@@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct
device_queue_manager *dqm,
<br>
mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
<br>
q->properties.type)];
<br>
- deallocate_doorbell(qpd, q);
<br>
-
<br>
if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
<br>
(q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)) {
<br>
deallocate_sdma_queue(dqm, q);
<br>
@@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct
device_queue_manager *dqm,
<br>
}
<br>
}
<br>
+ deallocate_doorbell(qpd, q);
<br>
+
<br>
</blockquote>
<br>
I'm not sure what this change is meant to do. I don't see an early
return in this function, so deallocate_doorbell will always be
executed either way.
<br>
</blockquote>
<p>If app destroy queue deallocate_doorbell, but then unmap queue
failed, app will destroy queue again when app exit,
deallocate_doorbell second time will trigger the WARN backtrace.</p>
<p>As queue_count and q->list is used to alloc ring buf in
execute_queues_cpsch, so this change causes regression on gfx9, I
will submit new patch to handle unmap failed case with MES and fix
those issues.</p>
<p>Regards,</p>
<p>Philip<br>
</p>
<blockquote type="cite" cite="mid:2f71d9b5-59aa-45cb-0245-7dc139f108b1@amd.com">
<br>
<blockquote type="cite"> /*
<br>
* Unconditionally decrement this counter, regardless of
the queue's
<br>
* type
<br>
diff --git
a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
<br>
index dc00484ff484..99f2a6412201 100644
<br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
<br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
<br>
@@ -419,7 +419,6 @@ int pqm_destroy_queue(struct
process_queue_manager *pqm, unsigned int qid)
<br>
}
<br>
if (pqn->q) {
<br>
- kfd_procfs_del_queue(pqn->q);
<br>
dqm = pqn->q->device->dqm;
<br>
retval = dqm->ops.destroy_queue(dqm,
&pdd->qpd, pqn->q);
<br>
if (retval) {
<br>
@@ -439,6 +438,7 @@ int pqm_destroy_queue(struct
process_queue_manager *pqm, unsigned int qid)
<br>
if (dev->shared_resources.enable_mes)
<br>
amdgpu_amdkfd_free_gtt_mem(dev->adev,
<br>
pqn->q->gang_ctx_bo);
<br>
+ kfd_procfs_del_queue(pqn->q);
<br>
</blockquote>
<br>
This part makes sense.
<br>
<br>
Regards,
<br>
Felix
<br>
<br>
<br>
<blockquote type="cite"> uninit_queue(pqn->q);
<br>
}
<br>
</blockquote>
</blockquote>
</body>
</html>