[PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

Christian König christian.koenig at amd.com
Tue Jun 24 14:23:31 UTC 2025


On 24.06.25 16:03, Riana Tauro wrote:
> Hi Christian
> 
> On 6/24/2025 5:56 PM, Christian König wrote:
>> On 23.06.25 12:01, Riana Tauro wrote:
>>> A device is declared wedged when it is non-recoverable from
>>> the driver context.
>>
>> Well, not quite.
> 
> i took this from the below document. Should it be changed?

The wedge event basically meant that something unexpected happened during the lifetime of the the device (crash, hang whatever).

It can be that the device recovered on it's own and nothing needs to be done (the none case in the documentation) and the event is just send for telemetry collection.

But the usual case is to trigger a bus reset, rebing or even reboot.

> https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging
> 
>>
>>> Some firmware errors can also cause
>>> the device to enter this state and the only method to recover
>>> from this would be to do a firmware flash
>>
>> What? What exactly do you mean with firmware flash here?
>>
>> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?
> 
> It means updating the firmware.
> 
> Series:  https://patchwork.freedesktop.org/series/149756/
> 
> In this xe kmd series, there are few firmware errors that cause the card to be non-functional. The device is declared wedged and a firmware-flash action is sent.

Ok, so let me recap that just to make sure that I did understood that correctly.

You find that the firmware flashed into the device is buggy and then raise a wedge event to automatically trigger a firmware update?

Why not fail to load the driver in the first place? Or at least print a big warning into the system log?

I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.

I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?

Thanks,
Christian.

> 
> There is corresponding fwupd PR in work that uses this uevent to trigger a firmware flash
> 
> fwupd PR: https://github.com/fwupd/fwupd/pull/8944/
> 
> Thanks
> Riana
> 
>>
>> Where is the rest of the series?
>>
>> Regards,
>> Christian.
>>
>>> v2: modify documentation (Raag, Rodrigo)
>>>
>>> Cc: André Almeida <andrealmeid at igalia.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>>> ---
>>>   Documentation/gpu/drm-uapi.rst | 6 +++---
>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>   include/drm/drm_device.h       | 1 +
>>>   3 files changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>> index 263e5a97c080..cd2481458755 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>   more side-effects. If driver is unsure about recovery or method is unknown
>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>> -will be sent instead.
>>> +(like soft/hard system reboot, physical device replacement or any other procedure
>>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>>>     Userspace consumers can parse this event and attempt recovery as per the
>>>   following expectations.
>>> @@ -435,6 +434,7 @@ following expectations.
>>>       none            optional telemetry collection
>>>       rebind          unbind + bind driver
>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>> +    firmware-flash  firmware flash
>>>       unknown         consumer policy
>>>       =============== ========================================
>>>   diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 02556363e918..5f3bbe01c207 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>           return "rebind";
>>>       case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>           return "bus-reset";
>>> +    case DRM_WEDGE_RECOVERY_FW_FLASH:
>>> +        return "firmware-flash";
>>>       default:
>>>           return NULL;
>>>       }
>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>> index 08b3b2467c4c..9d57c8882d93 100644
>>> --- a/include/drm/drm_device.h
>>> +++ b/include/drm/drm_device.h
>>> @@ -30,6 +30,7 @@ struct pci_controller;
>>>   #define DRM_WEDGE_RECOVERY_NONE        BIT(0)    /* optional telemetry collection */
>>>   #define DRM_WEDGE_RECOVERY_REBIND    BIT(1)    /* unbind + bind driver */
>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET    BIT(2)    /* unbind + reset bus device + bind */
>>> +#define DRM_WEDGE_RECOVERY_FW_FLASH    BIT(3)  /* firmware flash */
>>>     /**
>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>
> 
> 



More information about the Intel-xe mailing list