[PATCH 2/4] drm/xe: Add a helper function to set recovery method

Riana Tauro riana.tauro at intel.com
Thu Jun 19 07:26:28 UTC 2025


Hi Raag

Thank you for the review comments

On 6/6/2025 8:42 PM, Raag Jadav wrote:
> On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
>> Add a helper function to set recovery method. The recovery
>> method has to be set before declaring the device wedged and sending the
>> drm wedged uevent. If no method is set, default unbind/re-bind method
>> will be set
>>
>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>>   drivers/gpu/drm/xe/xe_device.h       |  1 +
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>>   3 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 660b0c5126dc..3fd604ebdc6e 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>   	xe_pm_runtime_put(xe);
>>   }
>>   
>> +/**
>> + * xe_device_set_wedged_method - Set wedged recovery method
>> + * @xe: xe device instance
> 
> Missing @method

Missed this. Will fix it>
>> + *
>> + * Set wedged recovery method to be sent using drm wedged uevent.
>> + */
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
>> +{
>> +	xe->wedged.method = method;
>> +}
>> +
>>   /**
>>    * xe_device_declare_wedged - Declare device wedged
>>    * @xe: xe device instance
>>    *
>> - * This is a final state that can only be cleared with a module
>> - * re-probe (unbind + bind).
>> - * In this state every IOCTL will be blocked so the GT cannot be used.
>> + * This is a final state that can only be cleared with the method specified
>> + * in the drm wedged uevent. The method needs to be set using xe_device_set_wedged_method
>> + * before declaring the device as wedged or the default method of reprobe (unbind/re-bind)
>> + * will be sent. In this state every IOCTL will be blocked so the GT cannot be used.
> 
> The file convention seems like 80 characters for kernel doc, so let's
> stick to it.

okay

> 
>>    * In general it will be called upon any critical error such as gt reset
>> - * failure or guc loading failure. Userspace will be notified of this state
>> - * through device wedged uevent.
>> + * failure or guc loading failure or firmware failure.
>> + * Userspace will be notified of this state through device wedged uevent.
>>    * If xe.wedged module parameter is set to 2, this function will be called
>>    * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>    * snapshot capture. In this mode, GT reset won't be attempted so the state of
>> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   		return;
>>   	}
>>   
>> +	/* If no wedge recovery method is set, use default */
>> +	if (!xe->wedged.method)
>> +		xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
>> +					    | DRM_WEDGE_RECOVERY_BUS_RESET);
> 
> Although there are no strict rules about this, we usually don't begin a
> new line with a symbol.

will fix this

> 
>> +
>>   	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>   		xe->needs_flr_on_fini = true;
>>   		drm_err(&xe->drm,
>> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   			dev_name(xe->drm.dev));
>>   
>>   		/* Notify userspace of wedged device */
>> -		drm_dev_wedged_event(&xe->drm,
>> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
>> +		drm_dev_wedged_event(&xe->drm, xe->wedged.method);
> 
> I was a bit late to realize it when I originally added this. The event
> call should be after xe_gt_declare_wedged() to comply with wedging rules.
> We notify userspace *after* we're done with driver cleanup.

Will move gt_wedged before uevent

Thanks
Riana

> 
> Raag
> 
>>   	}
>>   
>>   	for_each_gt(gt, xe, id)
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index 0bc3bc8e6803..06350740aac5 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>   }
>>   
>>   void xe_device_declare_wedged(struct xe_device *xe);
>> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>>   
>>   struct xe_file *xe_file_get(struct xe_file *xef);
>>   void xe_file_put(struct xe_file *xef);
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index b93c04466637..fb3617956d63 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -559,6 +559,8 @@ struct xe_device {
>>   		atomic_t flag;
>>   		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
>>   		int mode;
>> +		/** @wedged.method: Recovery method to be sent in the drm device wedged uevent */
>> +		unsigned long method;
>>   	} wedged;
>>   
>>   	/** @bo_device: Struct to control async free of BOs */
>> -- 
>> 2.47.1
>>
w



More information about the Intel-xe mailing list