[PATCH v7 1/5] drm: Introduce device wedged event

Fri Oct 18 17:56:17 UTC 2024

Em 18/10/2024 12:31, Alex Deucher escreveu:
> On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi at intel.com> wrote:
>>
>> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
>>> Hi Raag,
>>>
>>> Em 30/09/2024 04:38, Raag Jadav escreveu:
>>>> Introduce device wedged event, which will notify userspace of wedged
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is no longer operating as
>>>> expected even after a hardware reset and has become unrecoverable from
>>>> driver context.
>>>>
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>>>>
>>>>
>>>
>>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
>>>
>>> The motivation was that amdgpu was getting stuck after every GPU reset, and
>>> there was just a black screen. The uevent would then trigger a daemon to
>>> reset the compositor and getting things back together. As you can see in my
>>> thread, the feature was blocked in favor of getting better overall GPU reset
>>> from the kernel side.
>>>
>>> Which kind of scenarios are making i915/xe the need to have userspace
>>> involvement? I tested a bunch of resets in i915 but never managed to get the
>>> driver stuck.
>>
>> 2 scenarios:
>>
>> 1. Multiple levels of reset has failed and device was declared wedged. This is
>> rare indeed as the resets improved a lot.
>> 2. Debug case. We can boot the driver with option to declare device wedged at
>> any timeout, so the device can be debugged.
>>
>>>
>>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
>>> intervention.
>>
>> How do you trigger that?
> 
> What do you mean by bus reset?  I think Chrisitian is just referring
> to a full adapter reset (as opposed to a queue reset or something more
> fine grained).  Driver can reset the device via MMIO or firmware,
> depending on the device.  I think there are also PCI helpers for
> things like PCI FLR.
> 

I was referring to AMD_RESET_PCI:

"Does a full bus reset using core Linux subsystem PCI reset and does a 
secondary bus reset or FLR, depending on what the underlying hardware 
supports."

And that can be triggered by using `amdgpu_reset_method=5` as the module 
option.