[PATCH v1] drm/xe/vf: Fail migration recovery if fixups needed but platform not supported

Tue May 13 23:16:04 UTC 2025

On 13.05.2025 13:21, Michal Wajdeczko wrote:
>
> On 13.05.2025 01:06, Tomasz Lis wrote:
>> The post-migration recovery needs to be fully implemented for a
>> specific platform in order to make continuation of workloads
>> possible.
>>
>> New platforms introduce changes which affect the recovery procedure,
>> and without a clear verification of support this leads to errors
>> with no straight forward error message explaining the cause.
>>
>> This patch fixes that issue - it introduces a message to be logged
>> when the current driver is known to not support the current platform.
>>
>> Wedging the driver immediately also decreases the amount of
>> additional errors which would come afterwards if the driver continued
>> operation.
>>
>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_sriov_vf.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index 2674fa948fda..f21f98f5d25f 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -224,6 +224,11 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>>   	drm_dbg(&xe->drm, "another recovery imminent, skipping notifications\n");
>>   }
>>   
>> +static bool fixups_supported(struct xe_device *xe)
>> +{
> can we have some TODO comment here explaining what conditions we expect
> to be added here? or maybe we can start with CONFIG_XE_DEBUG to indicate
> early development phase?
do you mean `return IS_ENABLED(CONFIG_DRM_XE_DEBUG_SRIOV);`?
>> +	return false;
>> +}
>> +
>>   static void vf_post_migration_recovery(struct xe_device *xe)
>>   {
>>   	bool need_fixups;
>> @@ -243,6 +248,11 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>>   		vf_post_migration_fixup_ctb(xe);
>>   
>>   	vf_post_migration_notify_resfix_done(xe);
>> +	if (need_fixups && !fixups_supported(xe)) {
>> +		drm_err(&xe->drm, "migration recovery not supported by this module version\n");
> we already have drm_err in the fail: section, do we need this extra one?
The idea behind introducing this error is to inform the user 
specifically that the module version used has no support of VF 
migration. We do need it, it's the point.
> if yes, can we make the message more specific
What specific information do you think should be conveyed there?
>   (and maybe the reason
> should be printed in fixups_supported() as for now it's all magic)

I was planning that function as a conditions check function, not 
decision function. Such functions typically just do the checks and 
return the result.

What's the concept you have?

> also, since support likely will not change between one migration and the
> other, maybe it should be just a single drm_info() message printed
> during a VF boot that any later migration will fail, without waiting
> until the first migration happen to surprise the user

The idea is to make sure user sees the message after fail - and he will 
first search for error message.

We've settled on providing the information when the issue happened; we 
can re-evaluate that if you think it's not the right call.

>> +		err = -ENOTRECOVERABLE;
>> +		goto fail;
>> +	}
> hmm, and this whole chunk seems to be placed in a wrong place - if
> fixups are not supported, why did we attempt to fixup CTB few lines
> above

My idea was to allow testing the existing recovery even if it's incomplete.

But it doesn't matter to me - can change if no fixups at all is better 
for whatever reason.

>   and claim that fixups are done? can you please explain

we never claim that to the user. We only take GuC out of RESFIX state.

As stated above, I have no problem with shifting the verification to 
earlier, it does not matter.

-Tomasz

>>   	xe_pm_runtime_put(xe);
>>   	drm_notice(&xe->drm, "migration recovery ended\n");
>>   	return;