[PATCH v6 07/11] drm/amdgpu: validate userq's last fence prior to destroying

Tue Jul 15 12:15:10 UTC 2025

On 15.07.25 13:50, Liang, Prike wrote:
> [Public]
> 
> Regards,
>       Prike
> 
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Friday, July 11, 2025 8:13 PM
>> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>> Subject: Re: [PATCH v6 07/11] drm/amdgpu: validate userq's last fence prior to
>> destroying
>>
>> On 11.07.25 11:39, Prike Liang wrote:
>>> The userq requires validating queue status before destroying it, if
>>> user tries to destroy a busy userq by IOCTL then the driver should
>>> report an error for this illegal usage.
>>
>> Clear NAK, destroying a busy userqueue is perfectly valid!
> Yes, the firmware should handle such case something like as preempting the queue.
> If we directly unmap a hang queue and may further cause the MES firmware hang up,
> so, do we need to detect the hang userq here by checking the userq fence status and reset the hang queue before further performs the unmap queue?

No, waiting for the last fence should be perfectly sufficient since the hang detection is separate from this.

BTW please remove the 100ms timeout here, we should wait forever or until the hang detection has suspended the queue and signaled the fence with an error.

Regards,
Christian.

> 
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 15 ++++++++++++---
>>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index 81fbb00b6d91..bcbe8d3f66ed 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -281,7 +281,7 @@ amdgpu_userq_map_helper(struct amdgpu_userq_mgr
>> *uq_mgr,
>>>     return r;
>>>  }
>>>
>>> -static void
>>> +static int
>>>  amdgpu_userq_wait_for_last_fence(struct amdgpu_userq_mgr *uq_mgr,
>>>                              struct amdgpu_usermode_queue *queue)  { @@ -
>> 290,10 +290,14 @@
>>> amdgpu_userq_wait_for_last_fence(struct amdgpu_userq_mgr *uq_mgr,
>>>
>>>     if (f && !dma_fence_is_signaled(f)) {
>>>             ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100));
>>> -           if (ret <= 0)
>>> +           if (ret <= 0) {
>>>                     drm_file_err(uq_mgr->file, "Timed out waiting for
>> fence=%llu:%llu\n",
>>>                                  f->context, f->seqno);
>>> +                   return -ETIMEDOUT;
>>> +           }
>>>     }
>>> +
>>> +   return 0;
>>>  }
>>>
>>>  static void
>>> @@ -509,7 +513,12 @@ amdgpu_userq_destroy(struct drm_file *filp, int
>> queue_id)
>>>             mutex_unlock(&uq_mgr->userq_mutex);
>>>             return -EINVAL;
>>>     }
>>> -   amdgpu_userq_wait_for_last_fence(uq_mgr, queue);
>>> +
>>> +   if (amdgpu_userq_wait_for_last_fence(uq_mgr, queue)) {
>>> +           drm_warn(adev_to_drm(uq_mgr->adev), "Don't destroy a busy
>> userq\n");
>>> +           /* For the fence signal timeout case, it requires resetting the busy
>> queue.*/
>>> +           r = -ETIMEDOUT;
>>> +   }
>>>
>>>     /*
>>>      * At this point the userq obj va should be mapped,
>