[PATCH] drm/amdkfd: fix missed queue reset on queue destroy
Felix Kuehling
felix.kuehling at amd.com
Wed Aug 28 21:37:43 UTC 2024
On 2024-08-22 11:17, Jonathan Kim wrote:
> If a queue is being destroyed but causes a HWS hang on removal, the KFD
> may issue an unnecessary gpu reset if the destroyed queue can be fixed
> by a queue reset.
>
> This is because the queue has been removed from the KFD's queue list
> prior to the preemption action on destroy so the reset call will fail to
> match the HQD PQ reset information against the KFD's queue record to do
> the actual reset.
>
> To fix this, deactivate the queue prior to preemption since it's being
> destroyed anyways and remove the queue from the KFD's queue list after
> preemption.
>
> v2: early deactivate queue and delete queue from list later as-per
> description instead of destroy queue referencing hack.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 577d121cc6d1..6d5a632b95eb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -2407,10 +2407,10 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
> pdd->sdma_past_activity_counter += sdma_val;
> }
>
> - list_del(&q->list);
> qpd->queue_count--;
You may need to move the queue_count update as well to keep things
consistent. Please make sure this passes KFD queue tests on GPUs with
HWS and MES.
Other than that, this patch is
Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>
> if (q->properties.is_active) {
> decrement_queue_count(dqm, qpd, q);
> + q->properties.is_active = false;
> if (!dqm->dev->kfd->shared_resources.enable_mes) {
> retval = execute_queues_cpsch(dqm,
> KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
> @@ -2421,6 +2421,7 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
> retval = remove_queue_mes(dqm, q, qpd);
> }
> }
> + list_del(&q->list);
>
> /*
> * Unconditionally decrement this counter, regardless of the queue's
More information about the amd-gfx
mailing list