[PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
Riana Tauro
riana.tauro at intel.com
Tue Jul 1 11:37:46 UTC 2025
Hi Rodrigo/Christian
On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>> Or at least print a big warning into the system log?
>>>>>
>>>>> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
>>>>>
>>>>> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?
>>>>
>>>> The goal is not to do that automatically, but raise the uevent to the admin
>>>> with enough information that they can decide for the right correctable
>>>> action.
>>>
>>> Christian, Andre, any concerns with this still?
>>
>> Well, that sounds not quite the correct use case for wedge events.
>>
>> See the wedge event is made for automation.
>
> I respectfully disagree with this statement.
>
> The wedged state in i915 and xe, then ported to drm, was never just about
> automation. Of course, the unbind + flr + rebind is one that driver cannot
> do by itself, hence needs automation. But wedge cases were also very useful
> in other situations like keeping the device in the failure stage for debuging
> (without automation) or keeping other critical things up like display with SW
> rendering (again, nothing about automation).
>
>> For example to allow a process supervising containers get the device working again and re-start the container which used it or gather crash log etc .....
>>
>> When you want to notify the system administrator which manual intervention is necessary then I would just write that into the system log and raise a device event with WEDGED=unknown.
>>
>> What we could potentially do is to separate between WEDGED=unknown and WEDGED=manual, e.g. between driver has no idea what to do and driver printed useful info into the system log.
>
> Well, you are right here. Even our official documentation in drm-uapi.rst
> already tells that firmware flashing should be a case for 'unknown'.
I had added specific method since we know firmware flash will recover
the error. Sure will change it.
In the current code, there is no recovery method named "unknown" even
though the document mentions it
https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/drm_drv.c#L534
Since we are adding something new, can it be "manual" instead of unknown?
Thanks
Riana
> Let's go with that then. And use other hints like logs and sysfs so, Admin
> has a better information of what to do.
>
>>
>> But creating an event with WEDGED=firmware-flash just sounds to specific, when we go down that route we might soon have WEDGE=change-bios-setting, WEDGE=....
>
> Well, I agree that we shouldn't explode the options exponentially here.
>
> Although I believe that firmware flashing should be a common case in many
> case and could be a candidate for another indication.
>
> But let's move on with WEDGE='unknown' for this case.
>
> Thanks,
> Rodrigo.
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks,
>>>> Rodrigo.
More information about the dri-devel
mailing list