[PATCH] drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1
Kim, Jonathan
Jonathan.Kim at amd.com
Thu Jan 16 21:00:07 UTC 2025
[Public]
> -----Original Message-----
> From: Cornwall, Jay <Jay.Cornwall at amd.com>
> Sent: Thursday, January 16, 2025 3:41 PM
> To: amd-gfx at lists.freedesktop.org
> Cc: Cornwall, Jay <Jay.Cornwall at amd.com>; Kim, Jonathan
> <Jonathan.Kim at amd.com>
> Subject: [PATCH] drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1
>
> The purpose of halt_if_hws_hang is to preserve GPU state for driver
> debugging when queue preemption fails. Issuing per-queue reset may
> kill wavefronts which caused the preemption failure.
>
> Signed-off-by: Jay Cornwall <jay.cornwall at amd.com>
> Cc: Jonathan Kim <Jonathan.Kim at amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index f157494bfdb1..195085079eb2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -2327,9 +2327,9 @@ static int unmap_queues_cpsch(struct
> device_queue_manager *dqm,
> */
> mqd_mgr = dqm->mqd_mgrs[KFD_MQD_TYPE_HIQ];
> if (mqd_mgr->check_preemption_failed(mqd_mgr, dqm-
> >packet_mgr.priv_queue->queue->mqd)) {
> + while (halt_if_hws_hang)
> + schedule();
> if (reset_queues_on_hws_hang(dqm)) {
> - while (halt_if_hws_hang)
> - schedule();
> dqm->is_hws_hang = true;
> kfd_hws_hang(dqm);
> retval = -ETIME;
> --
> 2.34.1
More information about the amd-gfx
mailing list