[PATCH] drm/amdkfd: only flush the validate MES contex
Prike Liang
Prike.Liang at amd.com
Wed Jan 22 09:25:59 UTC 2025
The following page fault was observed duringthe KFD process release.
In this particular error case, the HIP test (./MemcpyPerformance -h)
does not require the queue. As a result, the process_context_addr was
not assigned when the KFD process was released, ultimately leading to
this page fault during the execution of kfd_process_dequeue_from_all_devices().
[345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[345962.295775] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
[345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
[345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1
[345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
[345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
Signed-off-by: Prike Liang <Prike.Liang at amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 9c2d8393cd4c..c39cdff58418 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -86,9 +86,13 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
if (pdd->already_dequeued)
return;
-
+ /* The MES context flush needs to filter out the case which the
+ * KFD process is created without setting up the MES context and
+ * queue for creating a compute queue.
+ */
dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
if (dev->kfd->shared_resources.enable_mes &&
+ !!pdd->proc_ctx_gpu_addr &&
down_read_trylock(&dev->adev->reset_domain->sem)) {
amdgpu_mes_flush_shader_debugger(dev->adev,
pdd->proc_ctx_gpu_addr);
--
2.34.1
More information about the amd-gfx
mailing list