[PATCH 1/1] drm/amdgpu: Use device wedged event
Christian König
christian.koenig at amd.com
Mon Dec 16 13:39:43 UTC 2024
Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>> would
>>>>> need some additional context also. We have a mechanism in KFD which
>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>> restricted to compute applications, but if this is in a similar
>>>>> line, we
>>>>> would like to pass some additional info like job timeout, RAS error
>>>>> etc.
>>>>>
>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>> reset and recovered, but additional data about like which job
>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>> data is gathered and collected later.
>>> I think somebody else mentioned it as well that the source of the
>>> issue, e.g. the PID of the submitting process would be helpful as well
>>> for supervising daemons which need to restart processes when they
>>> caused some issue.
>>>
>> It was me :) we have a use case that we would need the PID for the
>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>> or the job name that timeouted, there's no immediate action to be taken
>> with this information, contrary to the PID that we need to know.
>>
> Regarding devcoredump - it's not done every time. For ex: RAS errors
> have a different way to identify the source of error, hence we don't
> need a coredump in such cases.
>
> The intention is only to let the user know the reason for reset at a
> high level, and probably add more things later like the engines or
> queues that have reset etc.
Well what is the use case for that? That doesn't looks valuable to me.
RAS errors should generally be reported to the application who issued
the submission.
As a system wide event they are only useful in things like logfiles I think.
Regards,
Christian.
>
> Thanks,
> Lijo
>
>>> We just postponed adding that till later.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Regards,
>>>>>> Christian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20241216/6039c3f3/attachment-0001.htm>
More information about the Intel-gfx
mailing list