[PATCH 1/1] drm/amdgpu: Use device wedged event

Mon Dec 16 13:04:28 UTC 2024

Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> 
> 
> On 12/16/2024 3:48 PM, Christian König wrote:
>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>> Hi Christian,
>>>>>
>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>> capture.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid at igalia.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>> +
>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>
>>>>>> That looks really good in general. I would just make the
>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>
>>>>>
>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>> success
>>>>> of it, shouldn't we tell userspace?
>>>>
>>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>>> is handling it.
>>>>
>>>
>>> Right, and I think this raises the question of what wedge recovery
>>> method should I add for amdgpu... Christian?
>>>
>>
>> In theory a rebind should be enough to get the device going again, our
>> BOCO does a bus reset on driver load anyway.
>>
> 
> The behavior varies between SOCs. In certain ones, if driver reset
> fails, that means it's really in a bad state and it would need system
> reboot.
> 

Is this documented somewhere? Then I could even add a 
DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

> I had asked earlier about the utility of this one here. If this is just
> to inform userspace that driver has done a reset and recovered, it would
> need some additional context also. We have a mechanism in KFD which
> sends the context in which a reset has to be done. Currently, that's
> restricted to compute applications, but if this is in a similar line, we
> would like to pass some additional info like job timeout, RAS error etc.
> 

DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
reset and recovered, but additional data about like which job timeout, 
RAS error and such belong to devcoredump I guess, where all data is 
gathered and collected later.

> Thanks,
> Lijo
> 
>> Regards,
>> Christian.
>