[PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

Tue Jul 1 17:15:59 UTC 2025

Em 01/07/2025 13:44, Riana Tauro escreveu:
> 
> 
> On 7/1/2025 9:32 PM, Raag Jadav wrote:
>> On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
>>> On 01.07.25 16:23, Raag Jadav wrote:
>>>> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>>>>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>>>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>>>>
>>>>>>>>>>> I mean a firmware update is usually something which
>>>>>>>>>>> the system administrator triggers very explicitly
>>>>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>>>>> sometimes brick the HW.
>>>>>>>>>>>
>>>>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>>>>> meaning of the wedge event?
>>>>>>>>>>
>>>>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>>>>> uevent to the admin
>>>>>>>>>> with enough information that they can decide for the right 
>>>>>>>>>> correctable
>>>>>>>>>> action.
>>>>>>>>>
>>>>>>>>> Christian, Andre, any concerns with this still?
>>>>>>>>
>>>>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>>>>
>>>>>>>> See the wedge event is made for automation.
>>>>>>>
>>>>>>> I respectfully disagree with this statement.
>>>>>>>
>>>>>>> The wedged state in i915 and xe, then ported to drm, was never 
>>>>>>> just about
>>>>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>>>>> cannot
>>>>>>> do by itself, hence needs automation. But wedge cases were also very
>>>>>>> useful
>>>>>>> in other situations like keeping the device in the failure stage for
>>>>>>> debuging
>>>>>>> (without automation) or keeping other critical things up like
>>>>>>> display with SW
>>>>>>> rendering (again, nothing about automation).
>>>
>>> Granted, automation is probably not the right term.
>>>
>>> What I wanted to say is that the wedge event should not replace 
>>> information in the system log.
>>>
>>>>>>>
>>>>>>>> For example to allow a process supervising containers get the
>>>>>>>> device working again and re-start the container which used it or
>>>>>>>> gather crash log etc .....
>>>>>>>>
>>>>>>>> When you want to notify the system administrator which manual
>>>>>>>> intervention is necessary then I would just write that into the
>>>>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>>>>
>>>>>>>> What we could potentially do is to separate between
>>>>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>>>>> idea what to do and driver printed useful info into the system
>>>>>>>> log.
>>>>>>>
>>>>>>> Well, you are right here. Even our official documentation in drm- 
>>>>>>> uapi.rst
>>>>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>>>>
>>>>>> I had added specific method since we know firmware flash will recover
>>>>>> the error.  Sure will change it.
>>>>>>
>>>>>> In the current code, there is no recovery method named "unknown" even
>>>>>> though the document mentions it
>>>>>>
>>>>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>>>>> drm_drv.c#L534
>>>>>>
>>>>>> Since we are adding something new, can it be "manual" instead of 
>>>>>> unknown?
>>>>>
>>>>> Okay missed it. It's in the drm_dev_wedged_event function. Will use 
>>>>> unknown
>>>>>>
>>>>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>>>>> Admin
>>>>>>> has a better information of what to do.
>>>>>>>
>>>>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>>>>> specific, when we go down that route we might soon have
>>>>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>>>>
>>>>>>> Well, I agree that we shouldn't explode the options exponentially 
>>>>>>> here.
>>>>>>>
>>>>>>> Although I believe that firmware flashing should be a common case 
>>>>>>> in many
>>>>>>> case and could be a candidate for another indication.
>>>>>>>
>>>>>>> But let's move on with WEDGE='unknown' for this case.
>>>>
>>>> I understand that WEDGED=firmware-flash can't be handled in a 
>>>> generic way
>>>> for all drivers but it is simply not as same as WEDGED=unknown since 
>>>> the
>>>> driver knows something specific needs to be done here.
>>>>
>>>> I'm wondering if we could add a WEDGED=vendor-specific method for such
>>>> cases?
>>>
>>> Works for me as well.
>>>
>>> My main concern was that we should not start to invent specific wedge 
>>> events for all kind of different problems.
>>>
>>> On the other hand what's the additional value to distinct between 
>>> unknown and vendor-specific?
>>>
>>> In other words even if the necessary handling is unknown to the wedge 
>>> event, the administrator could and should still examine the logs to 
>>> see what to do.
>>
>> They're somewhat similar except the consumer can execute vendor specific
>> triggers (look at some sys/proc entries or logs) based on device id that
>> the consumer is already familiar with as defined by the vendor, and could
>> potentially be automated.
>>
>> Unknown is basically "I'm clueless and good luck with your 
>> investigation".
>>
>> So the distinction is whether the driver is able to provide definition 
>> for
>> its vendor specific cases and how well documented they are.
> 
> Agree with Raag. We know which recovery method works here. Rather than 
> using 'unknown', 'manual/vendor' macro seems better with vendor specific 
> documentation for recovery.
> 

That makes sense for me as well, thanks!

> Thanks
> Riana
> 
>>
>> Raag
> 
>