[PATCH v7 1/5] drm: Introduce device wedged event

Fri Oct 11 08:47:32 UTC 2024

On Thu, Oct 10, 2024 at 08:02:10AM -0500, Lucas De Marchi wrote:
> On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
> > On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
> > > On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> > > > On 30.09.2024 09:38, Raag Jadav wrote:
> > > > >
> > > > > +/**
> > > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > > > + * Drivers can choose to support any one or multiple of them depending on
> > > > > + * their needs.
> > > > > + */
> > > > > +enum drm_wedge_recovery {
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > > > +	DRM_WEDGE_RECOVERY_MAX
> > > > > +};
> > > > > +
> > > > >  /**
> > > > >   * struct drm_device - DRM device structure
> > > > >   *
> > > > > @@ -317,6 +337,9 @@ struct drm_device {
> > > > >  	 * Root directory for debugfs files.
> > > > >  	 */
> > > > >  	struct dentry *debugfs_root;
> > > > > +
> > > > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > > > +	unsigned long wedge_recovery;
> > > >
> > > > hmm, so before the driver can ask for a reboot as a recovery method from
> > > > wedge it has to somehow add 'reboot' as available method? why it that?
> > > 
> > > It's for consumers to use as fallbacks in case the preferred recovery method
> > > (sent along with uevent) don't workout. (patch 2/5)
> > 
> > On second thought...
> > 
> > Lucas, do we have a convincing enough usecase for fallback recovery?
> > If <method> were to fail, I would expect there to be even bigger problems
> > like kernel crash or unrecoverable hardware failure.
> > 
> > At that point is it worth retrying?
> 
> when we were talking about this, I brought it up about allowing the
> driver to inform what was the supported wedge recovery mechanisms
> when the notification is sent. Not to be intended as fallback mechanism.
> 
> So if the driver sends a notification with:
> 
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT
> 
> it means any of these would be suitable, with the first being the option
> with less side-effect. I don't think we are advising userspace to use
> fallback, just informing what the driver/device supports. Depending on
> the error, the driver may leave only
> 
> 	DRM_WEDGE_RECOVERY_REBOOT
> 
> That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
> state the driver doesn't really know what can be done to recover.
> With that we can drop _MAX and use _NONE for bounding check. I think
> we can also omit it in the notification as it's clear:
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET
> 
> This means the driver can use any of these options to recover
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_BUS_RESET
> 
> only bus reset would fix it
> 
> 	WEDGED
> 	
> driver doesn't know anything that could fix it. It may be a soft-reboot,
> hard-reboot, firmware flashing etc... We just don't know.

With this I think we can drop sysfs.
(Already too many ABIs to deal with)

Raag