[PATCH v3 3/4] drm/amdgpu: use drm_file_err in logging to also dump process information

Pierre-Eric Pelloux-Prayer pierre-eric at damsy.net
Wed Apr 16 12:07:24 UTC 2025


Hi,

Le 16/04/2025 à 12:01, Khatri, Sunil a écrit :
> 
> On 4/16/2025 12:56 PM, Tvrtko Ursulin wrote:
>>
>> On 15/04/2025 19:43, Sunil Khatri wrote:
>>> add process and pid information in the userqueue error
>>> logging to make it more useful in resolving the error
>>> by logs.
>>>
>>> Sample log:
>>> [   42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for 
>>> fence f=000000001c74d978 for comm:Xwayland pid:3427
>>> [   42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout 
>>> waiting for comm:Xwayland pid:3427
>>> [   42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for 
>>> fence f=0000000074407d3e for comm:systemd-logind pid:1058
>>> [   42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout 
>>> waiting for comm:systemd-logind pid:1058
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------
>>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/ 
>>> amdgpu_userqueue.c
>>> index 1867520ba258..05c1ee27a319 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>>> @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr *uq_mgr,
>>>       if (f && !dma_fence_is_signaled(f)) {
>>>           ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100));
>>>           if (ret <= 0) {
>>> -            DRM_ERROR("Timed out waiting for fence f=%p\n", f);
>>> +            drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f);
>>
>> You decided to leave %p after all?
> 
> Yes we are printing the fence ptr here to see which fence is timing out. Anyways right now intention 
> of this patch is to add additional process information along with existing information like fence here.
> 

I agree with Tvrtko, "fence=%llu:%llu" would be better to identify "which fence is timing out".


Pierre-Eric


> regards
> Sunil
> 
>>
>>>               return;
>>>           }
>>>       }
>>> @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct amdgpu_userq_mgr *uq_mgr)
>>>       }
>>>         if (ret)
>>> -        DRM_ERROR("Failed to map all the queues\n");
>>> +        drm_file_err(uq_mgr->file, "Failed to map all the queue\n");
>>
>> You lost the plural by accident.
> Yes i will add 's'. Noted.
>>
> I am also not sure "all the queues" makes sense in this context versus "all queues" but it's 
> inconsequential really.
> Regards
> Sunil
>> Yes it all queues from a uq_mgr.
>>> +
>>>       return ret;
>>>   }
>>>   @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct amdgpu_userq_mgr *uq_mgr)
>>>       }
>>>         if (ret)
>>> -        DRM_ERROR("Couldn't unmap all the queues\n");
>>> +        drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n");
>>> +
>>>       return ret;
>>>   }
>>>   @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct amdgpu_userq_mgr *uq_mgr)
>>>               continue;
>>>           ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100));
>>>           if (ret <= 0) {
>>> -            DRM_ERROR("Timed out waiting for fence f=%p\n", f);
>>> +            drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f);
>>>               return -ETIMEDOUT;
>>>           }
>>>       }
>>> @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct amdgpu_userq_mgr *uq_mgr,
>>>       /* Wait for any pending userqueue fence work to finish */
>>>       ret = amdgpu_userqueue_wait_for_signal(uq_mgr);
>>>       if (ret) {
>>> -        DRM_ERROR("Not suspending userqueue, timeout waiting for work\n");
>>> +        drm_file_err(uq_mgr->file, "Not suspending userqueue, timeout waiting\n");
>>>           return;
>>>       }
>>>         ret = amdgpu_userqueue_suspend_all(uq_mgr);
>>>       if (ret) {
>>> -        DRM_ERROR("Failed to evict userqueue\n");
>>> +        drm_file_err(uq_mgr->file, "Failed to evict userqueue\n");
>>>           return;
>>
>> It is pre-existing but strikes me as odd that failure to amdgpu_userqueue_suspend_all() logs a 
>> failure to *evict* instead of suspend (as the previous log does). Anyway, I did not look at the 
>> surrounding code so just thinking out loud.
> 
> Yes suspend failed as all the fences were not evicted and thats why suspend failed. Anyways there 
> are already alex patches which will change this to unmap as a code reorganisation for suspend/resume 
> is in pipeline.
> 
> regards
> 
> Sunil
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>       }
>>


More information about the dri-devel mailing list