[PATCH v2 4/4] drm/xe/vf: Defer fixups if migrated twice fast

Thu Sep 26 21:48:26 UTC 2024

On 26.09.2024 16:35, Michal Wajdeczko wrote:
>
> On 24.09.2024 22:25, Tomasz Lis wrote:
>> If another VF migration happened during post-migration recovery,
>> then the current worker should be finished to allow the next
>> one start swiftly and cleanly.
>>
>> Check for defer in two places: before fixups, and before
>> sending RESFIX_DONE.
>>
>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_sriov_vf.c | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index fe5eefa736c8..f326e507d73e 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -52,6 +52,19 @@ static int vf_post_migration_requery_guc(struct xe_device *xe)
>>   	return err;
>>   }
>>   
>> +/*
>> + * vf_post_migration_imminent - Check if post-restore recovery is coming.
>> + * @xe: the &xe_device struct instance
>> + *
>> + * Return: True if migration recovery worker will soon be running. Any worker currently
>> + * executing does not affect the result.
>> + */
>> +static bool vf_post_migration_imminent(struct xe_device *xe)
>> +{
>> +	return xe->sriov.vf.migration.gt_flags != 0 ||
>> +	work_pending(&xe->sriov.vf.migration.worker);
> make sure scripts/checkpatch.pl --strict is not complaining
ok
>
>> +}
>> +
>>   /*
>>    * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
>>    * @xe: the &xe_device struct instance
>> @@ -63,11 +76,17 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>>   	int err, num_sent = 0;
>>   
>>   	for_each_gt(gt, xe, id) {
>> +		if (vf_post_migration_imminent(xe))
>> +			goto skip;
> hmm, what if new migration happen right here? this is still racy and
> likely needs to be solved at GUC-VF protocol level, not by adding more
> check points in the driver

This is how the current spec works.

"At the end of patching, check if a new MIGRATED interrupt has been 
received"

We cannot guarantee no issues if the 2nd migration happens in this short 
period. It would be a really bad luck if someone actually ran into the 
issue where one of GuCs slips through the checks here, and that would 
cause engine hang. Even then, everything would continue after skipping 
the one job.

For the check points, there are exactly 2: before fixups and before 
RESFIX_DONE. we're not adding more.

-Tomasz

>
>>   		err = xe_gt_sriov_vf_notify_resfix_done(gt);
>>   		if (!err)
>>   			num_sent++;
>>   	}
>>   	drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);
>> +	return;
>> +
>> +skip:
>> +	drm_dbg(&xe->drm, "another recovery imminent, skipping notifications\n");
>>   }
>>   
>>   static void vf_post_migration_recovery(struct xe_device *xe)
>> @@ -77,6 +96,8 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>>   	drm_dbg(&xe->drm, "migration recovery in progress\n");
>>   	xe_pm_runtime_get(xe);
>>   	err = vf_post_migration_requery_guc(xe);
>> +	if (vf_post_migration_imminent(xe))
>> +		goto defer;
>>   	if (unlikely(err))
>>   		goto fail;
>>   
>> @@ -85,6 +106,10 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>>   	xe_pm_runtime_put(xe);
>>   	drm_notice(&xe->drm, "migration recovery ended\n");
>>   	return;
>> +defer:
>> +	xe_pm_runtime_put(xe);
>> +	drm_dbg(&xe->drm, "migration recovery deferred\n");
>> +	return;
>>   fail:
>>   	xe_pm_runtime_put(xe);
>>   	drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));