[PATCH v3 4/7] drm/xe: Block reset while recovering from VF migration

Wed May 28 20:02:25 UTC 2025

On Tue, May 20, 2025 at 01:19:22AM +0200, Tomasz Lis wrote:
> Resetting GuC during recovery could interfere with the recovery
> process. Such reset might be also triggered without justification,
> due to migration taking time, rather than due to the workload not
> progressing.
> 
> Doing GuC reset during the recovery would cause exit of RESFIX state,
> and therefore continuation of GuC work while fixups are still being
> applied. To avoid that, reset needs to be blocked during the recovery.
> 
> This patch blocks the reset during recovery. Reset request in that
> time range will be dropped.
> 
> In case a reset procedure already started while the recovery is
> triggered, there isn't much we can do - we cannot wait for it to
> finish as it involves waiting for hardware, and we can't be sure
> at which exact point of the reset procedure the GPU got switched.
> Therefore, the rare cases where migration happens while reset is
> in progress, are still dangerous. Resets are not a part of the
> standard flow, and cause unfinished workloads - that will happen
> during the reset interrupted by migration as well, so it doesn't
> diverge that much from what normally happens during such resets.
> 
> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 26 ++++++++++++++++++++++++--
>  drivers/gpu/drm/xe/xe_guc_submit.h |  2 ++
>  drivers/gpu/drm/xe/xe_sriov_vf.c   | 12 ++++++++++--
>  3 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 6f280333de13..69ccfb2e1cff 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1761,7 +1761,11 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
>  	}
>  }
>  
> -int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +/**
> + * xe_guc_submit_reset_block - Disallow reset calls on given GuC.
> + * @guc: the &xe_guc struct instance
> + */
> +int xe_guc_submit_reset_block(struct xe_guc *guc)
>  {
>  	int ret;
>  
> @@ -1774,6 +1778,24 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>  	 */
>  	ret = atomic_fetch_or(1, &guc->submission_state.stopped);
>  	smp_wmb();
> +
> +	return ret;
> +}
> +
> +/**
> + * xe_guc_submit_reset_unblock - Allow back reset calls on given GuC.
> + * @guc: the &xe_guc struct instance
> + */
> +void xe_guc_submit_reset_unblock(struct xe_guc *guc)
> +{
> +	atomic_dec(&guc->submission_state.stopped);
> +}
> +
> +int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> +{
> +	int ret;
> +
> +	ret = xe_guc_submit_reset_block(guc);
>  	wake_up_all(&guc->ct.wq);
>  
>  	return ret;
> @@ -1849,7 +1871,7 @@ int xe_guc_submit_start(struct xe_guc *guc)
>  	xe_gt_assert(guc_to_gt(guc), xe_guc_read_stopped(guc) == 1);
>  
>  	mutex_lock(&guc->submission_state.lock);
> -	atomic_dec(&guc->submission_state.stopped);
> +	xe_guc_submit_reset_unblock(guc);
>  	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) {
>  		/* Prevent redundant attempts to start parallel queues */
>  		if (q->guc->id != index)
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> index f1cf271492ae..2c2d2936440d 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> @@ -20,6 +20,8 @@ void xe_guc_submit_stop(struct xe_guc *guc);
>  int xe_guc_submit_start(struct xe_guc *guc);
>  void xe_guc_submit_pause(struct xe_guc *guc);
>  void xe_guc_submit_unpause(struct xe_guc *guc);
> +int xe_guc_submit_reset_block(struct xe_guc *guc);
> +void xe_guc_submit_reset_unblock(struct xe_guc *guc);
>  void xe_guc_submit_wedge(struct xe_guc *guc);
>  
>  int xe_guc_read_stopped(struct xe_guc *guc);
> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
> index fcd82a0fda48..82b3dd57de73 100644
> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
> @@ -150,9 +150,15 @@ static void vf_post_migration_shutdown(struct xe_device *xe)
>  {
>  	struct xe_gt *gt;
>  	unsigned int id;
> +	int ret = 0;
>  
> -	for_each_gt(gt, xe, id)
> +	for_each_gt(gt, xe, id) {
>  		xe_guc_submit_pause(&gt->uc.guc);
> +		ret |= xe_guc_submit_reset_block(&gt->uc.guc);
> +	}
> +
> +	if (ret)
> +		drm_info(&xe->drm, "migration recovery encountered ongoing reset\n");

I'd suggest debug level, as it doesn't seem all that useful in general.
But I guess you want to have a trace that this happened and the handling
of this scenario is somewhat iffy...

>  }
>  
>  /**
> @@ -170,8 +176,10 @@ static void vf_post_migration_kickstart(struct xe_device *xe)
>  	/* make sure interrupts on the new HW are properly set */
>  	xe_irq_resume(xe);
>  
> -	for_each_gt(gt, xe, id)
> +	for_each_gt(gt, xe, id) {
> +		xe_guc_submit_reset_unblock(&gt->uc.guc);

If we were doing migration recovery during reset, we'll change the state
of guc->submission_state.stopped (btw... is it possible to decrement it
below 0 now?).
There are "some" places where we have asserts that expect a particular
state, and some places where we use that state as an event (combined
with a waitqueue). Should we wake up the waiters at some point? And
perhaps synchronize with the places which depend on being in a "stopped"
state (if at all possible)?

Thanks,
-Michał

>  		xe_guc_submit_unpause(&gt->uc.guc);
> +	}
>  }
>  
>  /**
> -- 
> 2.25.1
>