[PATCH 1/1] drm/amdgpu: Use device wedged event

Mon Dec 16 13:10:46 UTC 2024

Am 16.12.24 um 14:04 schrieb André Almeida:
> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>
>>
>> On 12/16/2024 3:48 PM, Christian König wrote:
>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>> capture.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealmeid at igalia.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>> %d\n", r);
>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> +
>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>
>>>>>>> That looks really good in general. I would just make the
>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>
>>>>>>
>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>> success
>>>>>> of it, shouldn't we tell userspace?
>>>>>
>>>>> A failed reset would perhaps result in wedging, atleast that's how 
>>>>> i915
>>>>> is handling it.
>>>>>
>>>>
>>>> Right, and I think this raises the question of what wedge recovery
>>>> method should I add for amdgpu... Christian?
>>>>
>>>
>>> In theory a rebind should be enough to get the device going again, our
>>> BOCO does a bus reset on driver load anyway.
>>>
>>
>> The behavior varies between SOCs. In certain ones, if driver reset
>> fails, that means it's really in a bad state and it would need system
>> reboot.
>>
>
> Is this documented somewhere? Then I could even add a 
> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

Not publicly as far as I know. But indeed a driver reset has basically 
the same chance of succeeding than a driver reload.

I think the use case we have here is more that the administrator 
intentionally disabled the reset to allow HW investigation.

So far we did that with a rather broken we don't do anything at all 
approach.

>> I had asked earlier about the utility of this one here. If this is just
>> to inform userspace that driver has done a reset and recovered, it would
>> need some additional context also. We have a mechanism in KFD which
>> sends the context in which a reset has to be done. Currently, that's
>> restricted to compute applications, but if this is in a similar line, we
>> would like to pass some additional info like job timeout, RAS error etc.
>>
>
> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
> reset and recovered, but additional data about like which job timeout, 
> RAS error and such belong to devcoredump I guess, where all data is 
> gathered and collected later.

I think somebody else mentioned it as well that the source of the issue, 
e.g. the PID of the submitting process would be helpful as well for 
supervising daemons which need to restart processes when they caused 
some issue.

We just postponed adding that till later.

Regards,
Christian.

>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>
>