[PATCH v7 1/5] drm: Introduce device wedged event

Thu Oct 10 13:02:10 UTC 2024

On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
>On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
>> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
>> > On 30.09.2024 09:38, Raag Jadav wrote:
>> > >
>> > > +/**
>> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> > > + * Drivers can choose to support any one or multiple of them depending on
>> > > + * their needs.
>> > > + */
>> > > +enum drm_wedge_recovery {
>> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> > > +	DRM_WEDGE_RECOVERY_REBIND,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> > > +	DRM_WEDGE_RECOVERY_REBOOT,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> > > +	DRM_WEDGE_RECOVERY_MAX
>> > > +};
>> > > +
>> > >  /**
>> > >   * struct drm_device - DRM device structure
>> > >   *
>> > > @@ -317,6 +337,9 @@ struct drm_device {
>> > >  	 * Root directory for debugfs files.
>> > >  	 */
>> > >  	struct dentry *debugfs_root;
>> > > +
>> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
>> > > +	unsigned long wedge_recovery;
>> >
>> > hmm, so before the driver can ask for a reboot as a recovery method from
>> > wedge it has to somehow add 'reboot' as available method? why it that?
>>
>> It's for consumers to use as fallbacks in case the preferred recovery method
>> (sent along with uevent) don't workout. (patch 2/5)
>
>On second thought...
>
>Lucas, do we have a convincing enough usecase for fallback recovery?
>If <method> were to fail, I would expect there to be even bigger problems
>like kernel crash or unrecoverable hardware failure.
>
>At that point is it worth retrying?

when we were talking about this, I brought it up about allowing the
driver to inform what was the supported wedge recovery mechanisms
when the notification is sent. Not to be intended as fallback mechanism.

So if the driver sends a notification with:

	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT

it means any of these would be suitable, with the first being the option
with less side-effect. I don't think we are advising userspace to use
fallback, just informing what the driver/device supports. Depending on
the error, the driver may leave only

	DRM_WEDGE_RECOVERY_REBOOT

That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
state the driver doesn't really know what can be done to recover.
With that we can drop _MAX and use _NONE for bounding check. I think
we can also omit it in the notification as it's clear:

	WEDGED
	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET

This means the driver can use any of these options to recover

	WEDGED
	DRM_WEDGE_RECOVERY_BUS_RESET

only bus reset would fix it

	WEDGED

driver doesn't know anything that could fix it. It may be a soft-reboot,
hard-reboot, firmware flashing etc... We just don't know.

Lucas De Marchi