[PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
Raag Jadav
raag.jadav at intel.com
Tue Jul 1 16:02:38 UTC 2025
On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
>On 01.07.25 16:23, Raag Jadav wrote:
>> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>>
>>>>>>>>> I mean a firmware update is usually something which
>>>>>>>>> the system administrator triggers very explicitly
>>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>>> sometimes brick the HW.
>>>>>>>>>
>>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>>> meaning of the wedge event?
>>>>>>>>
>>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>>> uevent to the admin
>>>>>>>> with enough information that they can decide for the right correctable
>>>>>>>> action.
>>>>>>>
>>>>>>> Christian, Andre, any concerns with this still?
>>>>>>
>>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>>
>>>>>> See the wedge event is made for automation.
>>>>>
>>>>> I respectfully disagree with this statement.
>>>>>
>>>>> The wedged state in i915 and xe, then ported to drm, was never just about
>>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>>> cannot
>>>>> do by itself, hence needs automation. But wedge cases were also very
>>>>> useful
>>>>> in other situations like keeping the device in the failure stage for
>>>>> debuging
>>>>> (without automation) or keeping other critical things up like
>>>>> display with SW
>>>>> rendering (again, nothing about automation).
>
> Granted, automation is probably not the right term.
>
> What I wanted to say is that the wedge event should not replace information in the system log.
>
>>>>>
>>>>>> For example to allow a process supervising containers get the
>>>>>> device working again and re-start the container which used it or
>>>>>> gather crash log etc .....
>>>>>>
>>>>>> When you want to notify the system administrator which manual
>>>>>> intervention is necessary then I would just write that into the
>>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>>
>>>>>> What we could potentially do is to separate between
>>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>>> idea what to do and driver printed useful info into the system
>>>>>> log.
>>>>>
>>>>> Well, you are right here. Even our official documentation in drm-uapi.rst
>>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>>
>>>> I had added specific method since we know firmware flash will recover
>>>> the error. Sure will change it.
>>>>
>>>> In the current code, there is no recovery method named "unknown" even
>>>> though the document mentions it
>>>>
>>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>>> drm_drv.c#L534
>>>>
>>>> Since we are adding something new, can it be "manual" instead of unknown?
>>>
>>> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
>>>>
>>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>>> Admin
>>>>> has a better information of what to do.
>>>>>
>>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>>> specific, when we go down that route we might soon have
>>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>>
>>>>> Well, I agree that we shouldn't explode the options exponentially here.
>>>>>
>>>>> Although I believe that firmware flashing should be a common case in many
>>>>> case and could be a candidate for another indication.
>>>>>
>>>>> But let's move on with WEDGE='unknown' for this case.
>>
>> I understand that WEDGED=firmware-flash can't be handled in a generic way
>> for all drivers but it is simply not as same as WEDGED=unknown since the
>> driver knows something specific needs to be done here.
>>
>> I'm wondering if we could add a WEDGED=vendor-specific method for such
>> cases?
>
> Works for me as well.
>
> My main concern was that we should not start to invent specific wedge events for all kind of different problems.
>
> On the other hand what's the additional value to distinct between unknown and vendor-specific?
>
> In other words even if the necessary handling is unknown to the wedge event, the administrator could and should still examine the logs to see what to do.
They're somewhat similar except the consumer can execute vendor specific
triggers (look at some sys/proc entries or logs) based on device id that
the consumer is already familiar with as defined by the vendor, and could
potentially be automated.
Unknown is basically "I'm clueless and good luck with your investigation".
So the distinction is whether the driver is able to provide definition for
its vendor specific cases and how well documented they are.
Raag
More information about the Intel-xe
mailing list