[PATCH 2/2] drm/xe/vf: Don't update GuC reset policy when changing wedged mode

Fri Apr 4 10:03:56 UTC 2025

On 4/3/2025 18:49, Michal Wajdeczko wrote:
>
> On 03.04.2025 17:17, Laguna, Lukasz wrote:
>> On 4/3/2025 13:05, Michal Wajdeczko wrote:
>>> On 03.04.2025 11:41, Lukasz Laguna wrote:
>>>> Prevent the VF from attempting to update the GuC reset policy when
>>>> changing the wedged mode, as this operation is not supported for VFs.
>>>>
>>>> Log a message to indicate that GuC may still cause engine reset even
>>>> with wedged_mode=2.
>>>>
>>>> Signed-off-by: Lukasz Laguna <lukasz.laguna at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_debugfs.c | 5 +++++
>>>>    1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/
>>>> xe_debugfs.c
>>>> index d0503959a8ed..062668d02365 100644
>>>> --- a/drivers/gpu/drm/xe/xe_debugfs.c
>>>> +++ b/drivers/gpu/drm/xe/xe_debugfs.c
>>>> @@ -171,6 +171,11 @@ static ssize_t wedged_mode_set(struct file *f,
>>>> const char __user *ubuf,
>>>>          xe->wedged.mode = wedged_mode;
>>>>    +    if (IS_SRIOV_VF(xe)) {
>>>> +        drm_info_once(&xe->drm, "VF can't change GuC's engine reset
>>>> policy. GuC may still cause engine reset even with wedged_mode=2\n");
>>> never use drm_info_once() logs for places where multiple different
>>> devices can be reached, as then doing something similar on next device
>>> there will be no trace at all
>> ok
>>
>>> also should we change xe->wedged.mode if it is N/A for VF?
>> We can't change engine reset policy, but I wouldn't say that wedged.mode
>> is N/A for VFs.
>>
>> We can still disable it (mode=0), or use mode=2 to easily wedge device
>> with simple exec timeout in order to e.g. validate whether driver
>> behavior in case of wedged device is correct (all IOCTLs are blocked
>> until rebind).
> I was referring to the code where first there is an assignment of the
> new wedged mode and then we abort and eventually print message that it
> wasn't applied - IMO this is wrong

Depending how we look on it. Wedged mode was changed and driver behavior 
will be different. Now device will be declared wedged on every GPU hang.

But if we decide that in wedged mode=2, engine resets MUST be disabled 
then I agree that wedge.mode should not be updated in case of an error 
of reset policy change.

Next thing is that we shouldn't change wedge.mode and return error at 
the same time. If disabled engine resets in mode 2 are not a must, then 
we should return success.

BTW, I noticed some inconsistency. Looks like if we set wedged mode = 2 
using module parameter we don't disable engine resets.

>>> btw, what is our approach if someone on the PF already set some policy
>>> on GuC will not do engine reset? or it is n/a after a VF switch?
>> It's global policy - engine resets will be disabled for PF and VFs, and
>> that's something I would expect.
> my question is whether ordinary VF user expects that such policy could
> be set differently on different setups (or changed at any time on PF)
>
> maybe for the production PF case we shouldn't allow changing PF wedge
> mode (something like we do for EUDBG or CCS mode case in PF w/VFs)

We can change it on the fly only via debugfs. In production it's 
possible only to set it using unsafed_param. Should we block creating 
VFs for production PF in case of wedged_mode=2?

>>>> +        return size;
>>> shouldn't we return -EPERM instead?
>> I returned size on purpose, as it's expected that VF is not able to
>> change reset policy.
> hmm, but by returning 'size' you're saying "OK"
> not that VF can't change it
>
> when read back, there will new wedge mode visible
> (well, it won't work as expected, though)

Based on what I shared above, for me it was "OK", but I can be wrong.
My POV: With current implementation we can successfully set wedged mode 
to "upon-any-hang" on VF, disable engine resets on PF if needed and 
check devcoredump on VF for debug.
Do you think we should make wedge_mode=2 as N/A for VF?

>> Still, informed user that GuC may cause engine
>> resets even in mode=2. I can add extra info, that if needed, engine
>> resets needs to be disabled by PF.
> I'm not sure we need detailed messages for that in dmesg
> good release notes about 'wedge_mode' should be sufficient
>
>>>> +    }
>>>> +
>>>>        xe_pm_runtime_get(xe);
>>>>        for_each_gt(gt, xe, id) {
>>>>            ret = xe_guc_ads_scheduler_policy_toggle_reset(&gt-
>>>>> uc.guc.ads);