[PATCH] drm/kfd: fix a system crash issue during GPU recovery
Felix Kuehling
felix.kuehling at amd.com
Tue Sep 1 13:26:04 UTC 2020
I'm not sure how the bug you're fixing is caused, but your fix is
clearly in the wrong place.
A queue being disabled is not the same thing as a queue being destroyed.
Queues can be disabled for legitimate reasons, but they still should
exist and be in the qpd->queues_list.
If a destroyed queue is left on the qpd->queues_list, that would be a
problem. Can you point out where such a thing is happening?
Thanks,
Felix
Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li:
> The crash log as the below:
>
> [Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI
> [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G OE 5.4.0-42-generic #46~18.04.1-Ubuntu
> [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
> [Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
> [Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 [amdgpu]
> [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
> [Thu Aug 20 23:18:14 2020] RSP: 0018:ffffb29b52f6fc90 EFLAGS: 00010213
> [Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038
> [Thu Aug 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8a0e2e4178c0
> [Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: 0000000000001b64 R09: 0000000000000004
> [Thu Aug 20 23:18:14 2020] R10: ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28
> [Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000
> [Thu Aug 20 23:18:14 2020] FS: 0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000
> [Thu Aug 20 23:18:14 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0
> [Thu Aug 20 23:18:14 2020] Call Trace:
> [Thu Aug 20 23:18:14 2020] kfd_process_evict_queues+0x43/0xd0 [amdgpu]
> [Thu Aug 20 23:18:14 2020] kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
> [Thu Aug 20 23:18:14 2020] kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
> [Thu Aug 20 23:18:14 2020] kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
> [Thu Aug 20 23:18:14 2020] amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
> [Thu Aug 20 23:18:14 2020] amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
> [Thu Aug 20 23:18:14 2020] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
> [Thu Aug 20 23:18:14 2020] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
> [Thu Aug 20 23:18:14 2020] process_one_work+0x20f/0x400
> [Thu Aug 20 23:18:14 2020] worker_thread+0x34/0x410
>
> When GPU hang, user process will fail to create a compute queue whose
> struct object will be freed later, but driver wrongly add this queue to
> queue list of the proccess. And then kfd_process_evict_queues will
> access a freed memory, which cause a system crash.
>
> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 560adc57a050..d5e6b07ffb27 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1296,16 +1296,18 @@ static int create_queue_cpsch(struct device_queue_manager *dqm, struct queue *q,
> mqd_mgr->init_mqd(mqd_mgr, &q->mqd, q->mqd_mem_obj,
> &q->gart_mqd_addr, &q->properties);
>
> - list_add(&q->list, &qpd->queues_list);
> - qpd->queue_count++;
> -
> if (q->properties.is_active) {
> increment_queue_count(dqm, q->properties.type);
>
> retval = execute_queues_cpsch(dqm,
> KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
> + if (retval)
> + goto out_execute_cpsch;
> }
>
> + list_add(&q->list, &qpd->queues_list);
> + qpd->queue_count++;
> +
> /*
> * Unconditionally increment this counter, regardless of the queue's
> * type or whether the queue is active.
> @@ -1318,6 +1320,9 @@ static int create_queue_cpsch(struct device_queue_manager *dqm, struct queue *q,
> dqm_unlock(dqm);
> return retval;
>
> +out_execute_cpsch:
> + decrement_queue_count(dqm, q->properties.type);
> + dqm_unlock(dqm);
> out_deallocate_doorbell:
> deallocate_doorbell(qpd, q);
> out_deallocate_sdma_queue:
More information about the amd-gfx
mailing list