[PATCH v3] drm/xe/vf: Fail migration recovery if fixups needed but platform not supported

Mon May 19 20:23:38 UTC 2025


On 15.05.2025 13:12, Tomasz Lis wrote:
> The post-migration recovery needs to be fully implemented for a
> specific platform in order to make continuation of workloads
> possible.
> 
> New platforms introduce changes which affect the recovery procedure,
> and without a clear verification of support this leads to errors
> with no straight forward error message explaining the cause.
> 
> This patch fixes that issue - it introduces a message to be logged
> when the current driver is known to not support the current platform.
> 
> Wedging the driver immediately also decreases the amount of
> additional errors which would come afterwards if the driver continued
> operation.
> 
> v2: Show the message during probe as well as during recovery; do not
>   perform any recovery steps if the recovery is bound to fail
> v3: Use SRIOV-specific logging, fix typos
> 
> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Michał Winiarski <michal.winiarski at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_sriov_vf.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index 2674fa948fda..b578d171eb83 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -123,6 +123,15 @@
>   *      |                               |                               |
>   */
>  
> +static bool vf_migration_supported(struct xe_device *xe)
> +{
> +	/*
> +	 * TODO: Add conditions to allow specific platforms, when they're
> +	 * supported at production quality.
> +	 */
> +	return IS_ENABLED(CONFIG_DRM_XE_DEBUG_SRIOV);

if we want this feature to be at least lightly tested by our CI then it
should be CONFIG_DRM_XE_DEBUG instead, with that fixed or clarified

	Reviewed-by: Michal Wajdeczko <michal.wajdeczko at intel.com>

> +}
> +
>  static void migration_worker_func(struct work_struct *w);
>  
>  /**
> @@ -132,6 +141,9 @@ static void migration_worker_func(struct work_struct *w);
>  void xe_sriov_vf_init_early(struct xe_device *xe)
>  {
>  	INIT_WORK(&xe->sriov.vf.migration.worker, migration_worker_func);
> +
> +	if (!vf_migration_supported(xe))
> +		xe_sriov_info(xe, "migration not supported by this module version\n");
>  }
>  
>  /**
> @@ -236,6 +248,11 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>  		goto defer;
>  	if (unlikely(err))
>  		goto fail;
> +	if (!vf_migration_supported(xe)) {
> +		xe_sriov_err(xe, "migration not supported by this module version\n");
> +		err = -ENOTRECOVERABLE;
> +		goto fail;
> +	}
>  
>  	need_fixups = vf_post_migration_fixup_ggtt_nodes(xe);
>  	/* FIXME: add the recovery steps */