[PATCH v7 1/5] drm: Introduce device wedged event
Rodrigo Vivi
rodrigo.vivi at intel.com
Fri Oct 18 14:56:38 UTC 2024
On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
>
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> >
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> >
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> >
> > =============== ==================================
> > Recovery method Consumer expectations
> > =============== ==================================
> > rebind unbind + rebind driver
> > bus-reset unbind + reset bus device + rebind
> > reboot reboot system
> > =============== ==================================
> >
> >
>
> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
>
> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.
>
> Which kind of scenarios are making i915/xe the need to have userspace
> involvement? I tested a bunch of resets in i915 but never managed to get the
> driver stuck.
2 scenarios:
1. Multiple levels of reset has failed and device was declared wedged. This is
rare indeed as the resets improved a lot.
2. Debug case. We can boot the driver with option to declare device wedged at
any timeout, so the device can be debugged.
>
> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> intervention.
How do you trigger that?
More information about the Intel-xe
mailing list