[PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime survivability mode
Riana Tauro
riana.tauro at intel.com
Thu Jul 10 05:59:44 UTC 2025
Hi Umesh
On 7/10/2025 5:14 AM, Umesh Nerlige Ramappa wrote:
> On Wed, Jul 09, 2025 at 04:50:17PM +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be in a unusable
>> state requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates firmware flash is necessary by
>> wedging the device and exposing survivability mode sysfs.
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/survivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_survivability_mode.c | 42 ++++++++++++++++++-
>> drivers/gpu/drm/xe/xe_survivability_mode.h | 1 +
>> .../gpu/drm/xe/xe_survivability_mode_types.h | 1 +
>> 3 files changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/
>> drm/xe/xe_survivability_mode.c
>> index fefb027b1c84..ca1cfa13525a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -137,7 +137,8 @@ static ssize_t survivability_mode_show(struct
>> device *dev,
>> struct xe_survivability_info *info = survivability->info;
>> int index = 0, count = 0;
>>
>> - count += sysfs_emit_at(buff, count, "Survivability mode type:
>> Boot\n");
>> + count += sysfs_emit_at(buff, count, "Survivability mode type: %s\n",
>> + survivability->type ? "Runtime" : "Boot");
>>
>> if (!check_boot_failure(xe))
>> return count;
>> @@ -288,6 +289,45 @@ bool xe_survivability_mode_is_requested(struct
>> xe_device *xe)
>> return check_boot_failure(xe);
>> }
>>
>> +/**
>> + * xe_survivability_mode_runtime_enable - Initialize and enable
>> runtime survivability mode
>> + * @xe: xe device instance
>> + *
>> + * Initialize survivability information and enable runtime
>> survivability mode.
>> + * Runtime survivability mode is enabled when certain errors cause
>> the device to be
>> + * in non-recoverable state. The device is declared wedged with the
>> appropriate
>> + * recovery method and survivability mode sysfs exposed to userspace
>> + *
>> + * Return: 0 if runtime survivability mode is enabled or not
>> requested, negative error
>
> is the "not requested" still applicable here?
Copied it from boot survivability. Not applicable, will remove this
>
>
>> + * code otherwise.
>> + */
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe)
>> +{
>> + struct xe_survivability *survivability = &xe->survivability;
>> + struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> + int ret;
>> +
>> + if (!IS_DGFX(xe) || IS_SRIOV_VF(xe) || xe->info.platform <
>> XE_BATTLEMAGE) {
>
> Do you think this condition can be better handled with a
> has_runtime_survivability for platforms that support it?
Was used once so added it here. Can be split out to a different function
>
>> + dev_err(&pdev->dev, "Runtime Survivability Mode not
>> supported\n");
>> + return -EINVAL;
>> + }
>> +
>> + ret = init_survivability_mode(xe);
>> + if (ret)
>> + return ret;
>> +
>> + ret = create_survivability_sysfs(pdev);
>> + if (ret)
>> + dev_err(&pdev->dev, "Failed to create survivability mode
>> sysfs\n");
>
> You do not return ret in the above if condition. Is that intenational?
yeah this is intentional. The device has to be wedged since it is not
usable on such errors even without the sysfs.
Thanks
Riana
>
> Regards,
> Umesh
>
>> +
>> + survivability->type = XE_SURVIVABILITY_TYPE_RUNTIME;
>> + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
>> + xe_device_declare_wedged(xe);
>> +
>> + dev_err(&pdev->dev, "Runtime Survivability mode enabled\n");
>> + return 0;
>> +}
>> +
>> /**
>> * xe_survivability_mode_boot_enable - Initialize and enable boot
>> survivability mode
>> * @xe: xe device instance
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/
>> drm/xe/xe_survivability_mode.h
>> index f6ee283ea5e8..1cc94226aa82 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -11,6 +11,7 @@
>> struct xe_device;
>>
>> int xe_survivability_mode_boot_enable(struct xe_device *xe);
>> +int xe_survivability_mode_runtime_enable(struct xe_device *xe);
>> bool xe_survivability_mode_is_boot_enabled(struct xe_device *xe);
>> bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/
>> drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 5dce393498da..cd65a5d167c9 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -11,6 +11,7 @@
>>
>> enum xe_survivability_type {
>> XE_SURVIVABILITY_TYPE_BOOT,
>> + XE_SURVIVABILITY_TYPE_RUNTIME,
>> };
>>
>> struct xe_survivability_info {
>> --
>> 2.47.1
>>
More information about the Intel-xe
mailing list