[PATCH] drm/amdkfd: fix missed queue reset on queue destroy

Felix Kuehling felix.kuehling at amd.com
Wed Aug 28 21:37:43 UTC 2024


On 2024-08-22 11:17, Jonathan Kim wrote:
> If a queue is being destroyed but causes a HWS hang on removal, the KFD
> may issue an unnecessary gpu reset if the destroyed queue can be fixed
> by a queue reset.
>
> This is because the queue has been removed from the KFD's queue list
> prior to the preemption action on destroy so the reset call will fail to
> match the HQD PQ reset information against the KFD's queue record to do
> the actual reset.
>
> To fix this, deactivate the queue prior to preemption since it's being
> destroyed anyways and remove the queue from the KFD's queue list after
> preemption.
>
> v2: early deactivate queue and delete queue from list later as-per
> description instead of destroy queue referencing hack.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 577d121cc6d1..6d5a632b95eb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -2407,10 +2407,10 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
>   		pdd->sdma_past_activity_counter += sdma_val;
>   	}
>   
> -	list_del(&q->list);
>   	qpd->queue_count--;

You may need to move the queue_count update as well to keep things 
consistent. Please make sure this passes KFD queue tests on GPUs with 
HWS and MES.

Other than that, this patch is

Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>


>   	if (q->properties.is_active) {
>   		decrement_queue_count(dqm, qpd, q);
> +		q->properties.is_active = false;
>   		if (!dqm->dev->kfd->shared_resources.enable_mes) {
>   			retval = execute_queues_cpsch(dqm,
>   						      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
> @@ -2421,6 +2421,7 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
>   			retval = remove_queue_mes(dqm, q, qpd);
>   		}
>   	}
> +	list_del(&q->list);
>   
>   	/*
>   	 * Unconditionally decrement this counter, regardless of the queue's


More information about the amd-gfx mailing list