[PATCH 1/1] drm/amdgpu: Use device wedged event

Mon Dec 16 13:44:37 UTC 2024

On 12/16/2024 7:09 PM, Christian König wrote:
> Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>>> would
>>>>>> need some additional context also. We have a mechanism in KFD which
>>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>>> restricted to compute applications, but if this is in a similar
>>>>>> line, we
>>>>>> would like to pass some additional info like job timeout, RAS error
>>>>>> etc.
>>>>>>
>>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>>> reset and recovered, but additional data about like which job
>>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>>> data is gathered and collected later.
>>>> I think somebody else mentioned it as well that the source of the
>>>> issue, e.g. the PID of the submitting process would be helpful as well
>>>> for supervising daemons which need to restart processes when they
>>>> caused some issue.
>>>>
>>> It was me :) we have a use case that we would need the PID for the
>>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>>> or the job name that timeouted, there's no immediate action to be taken
>>> with this information, contrary to the PID that we need to know.
>>>
>> Regarding devcoredump - it's not done every time. For ex: RAS errors
>> have a different way to identify the source of error, hence we don't
>> need a coredump in such cases.
>>
>> The intention is only to let the user know the reason for reset at a
>> high level, and probably add more things later like the engines or
>> queues that have reset etc.
> 
> Well what is the use case for that? That doesn't looks valuable to me.

It's mostly for in-band telemetry reporting through tools like amd-smi -
 more for admin purpose rather than any debug.

Thanks,
Lijo

> 
> RAS errors should generally be reported to the application who issued
> the submission.
> 
> As a system wide event they are only useful in things like logfiles I think.
> 
> Regards,
> Christian.
> 
>> Thanks,
>> Lijo
>>
>>>> We just postponed adding that till later.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>