[PATCH AUTOSEL 6.1 02/16] amd/amdkfd: fix a kfd_process ref leak
Sasha Levin
sashal at kernel.org
Mon Jun 9 13:47:11 UTC 2025
From: Yifan Zhang <yifan1.zhang at amd.com>
[ Upstream commit 90237b16ec1d7afa16e2173cc9a664377214cdd9 ]
This patch is to fix a kfd_prcess ref leak.
Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>
Reviewed-by: Philip Yang <Philip.Yang at amd.com>
Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
Signed-off-by: Sasha Levin <sashal at kernel.org>
---
**YES**
This commit should be backported to stable kernel trees. Here's my
detailed analysis:
## Bug Analysis
This patch fixes a clear **resource leak bug** where a reference count
is not properly decremented in an error path. Specifically:
1. **The leak occurs in `kfd_signal_poison_consumed_event()`** at line
1334 where the function returns early without calling
`kfd_unref_process(p)`:
```c
if (unlikely(user_gpu_id == -EINVAL)) {
WARN_ONCE(1, "Could not get user_gpu_id from dev->id:%x\n", dev->id);
return; // BUG: Missing kfd_unref_process(p)
}
```
2. **The function `kfd_lookup_process_by_pasid()` increments the
reference count** on the kfd_process object (as confirmed by
examining the kernel source). This means every successful call must
be paired with `kfd_unref_process()`.
3. **The fix is minimal and correct**: It simply adds the missing
`kfd_unref_process(p)` call before the early return, ensuring the
reference count is properly decremented.
## Why This Should Be Backported
1. **It's a clear bug fix**: This fixes a reference count leak that will
cause memory to be leaked every time this error path is hit. Over
time, this could lead to significant memory exhaustion.
2. **The fix is minimal and contained**: Only one line is added, making
it low-risk for regression. The change doesn't alter any logic or
introduce new features.
3. **It affects user-visible behavior**: When GPU errors occur and this
code path is triggered repeatedly, the memory leak could impact
system stability and performance.
4. **Similar to previously backported commits**: Looking at the
historical examples, commit #4 which also fixed a memory leak in
create_process failure was marked for backporting and included "Cc:
stable at vger.kernel.org".
5. **The bug is in error handling code**: While the error condition
might be rare (invalid GPU ID), when it does occur, the leak happens
every time. Error handling bugs are particularly important to fix
because they can accumulate unnoticed.
6. **No architectural changes**: This is a simple bug fix that doesn't
change any APIs, data structures, or introduce new functionality.
The commit follows the stable kernel rules perfectly: it fixes a real
bug, is minimal in scope, has clear consequences if not fixed (memory
leak), and doesn't introduce new features or risky changes.
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index 2880ed96ac2e3..80d567ba94846 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -1340,6 +1340,7 @@ void kfd_signal_poison_consumed_event(struct kfd_dev *dev, u32 pasid)
user_gpu_id = kfd_process_get_user_gpu_id(p, dev->id);
if (unlikely(user_gpu_id == -EINVAL)) {
WARN_ONCE(1, "Could not get user_gpu_id from dev->id:%x\n", dev->id);
+ kfd_unref_process(p);
return;
}
--
2.39.5
More information about the amd-gfx
mailing list