[PATCH v7 1/5] drm: Introduce device wedged event

Fri Oct 18 12:46:19 UTC 2024

On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote:
> Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> > > > > Purpose of this implementation is to provide drivers a generic way to
> > > > > recover with the help of userspace intervention. Different drivers may
> > > > > have different ideas of a "wedged device" depending on their hardware
> > > > > implementation, and hence the vendor agnostic nature of the event.
> > > > > It is up to the drivers to decide when they see the need for recovery
> > > > > and how they want to recover from the available methods.
> > > > > 
> > > > > Current implementation defines three recovery methods, out of which,
> > > > > drivers can choose to support any one or multiple of them. Preferred
> > > > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > > > and take respective action to recover the device.
> > > > > 
> > > > >       =============== ==================================
> > > > >       Recovery method Consumer expectations
> > > > >       =============== ==================================
> > > > >       rebind          unbind + rebind driver
> > > > >       bus-reset       unbind + reset bus device + rebind
> > > > >       reboot          reboot system
> > > > >       =============== ==================================
> > > Well that sounds like userspace would need to be involved in recovery.
> > > 
> > > That in turn is a complete no-go since we at least need to signal all
> > > dma_fences to unblock the kernel. In other words things like bus reset needs
> > > to happen inside the kernel and *not* in userspace.
> > > 
> > > What we can do is to signal to userspace: Hey a bus reset of device X
> > > happened, maybe restart container, daemon, whatever service which was using
> > > this device.
> > Well, when we declare device 'wedged' it is because we don't want to take
> > any drastic measures inside the kernel and want to leave it in a protected
> > and unusable state. In a way that users wouldn't lose display for instance,
> > or at least the device is in a debugable state.
> 
> Uff, that needs to be very very well documented or otherwise the whole
> approach is an absolutely clear NAK from my side as DMA-buf maintainer.
> 
> > 
> > Then, the instructions here is to tell what could possibly be attempted
> > from userspace to get the device to an usable state.
> > 
> > The 'wedge' mode (the one emiting this uevent) needs to be responsible
> > for signaling all the fences and everything needed for a clean unbind
> > and whatever next step might be indicated to userspace.
> > 
> > That should already be part of any wedged mode, regardless the uevent
> > to inform the userspace here.
> 
> You need to approach that from a different side. With the current patch set
> you are ignoring documented mandatory driver behavior as far as I can see.
> 
> So first of all describe in the documentation what the wedged mode is and
> what requirements a driver has to fulfill to enter it:
> https://docs.kernel.org/gpu/drm-uapi.html#device-reset
>
> Especially document that all system memory accesses of the device needs to
> be blocked by (for example) disabling DMA accesses in the PCI config space.
> 
> When it is guaranteed that the device can't access any system memory any
> more the device driver should signal all pending fences of this device.
> 
> And only after all of that is done the driver  can send an uevent to inform
> userspace that it can debug the hanged state.

Sure, will do.

> As far as I can see this makes the enum how to recover the device
> superfluous because you will most likely always need a bus reset to get out
> of this again.

That depends on the kind of fault the device has encountered and the bus it is
sitting on. There could be buses that don't support reset.

Raag