[PATCH 2/2] drm/xe/vf: Retry sending MMIO request to GUC on timeout error

Thu Feb 20 11:53:03 UTC 2025

Satyanarayana K V P <satyanarayana.k.v.p at intel.com> wrote on czw [2025-lut-20 12:11:19 +0530]:
> Add support to allow retrying the sending of MMIO requests
> from the VF to the GUC in the event of an error. During the
> suspend/resume process, VFs begin resuming only after the PF has
> resumed. Although the PF resumes, the GUC reset and provisioning
> occur later in a separate worker process.
> 
> When there are a large number of VFs, some may attempt to resume
> before the PF has completed its provisioning. Therefore, if a
> MMIO request from a VF fails during this period, we will retry
> sending the request up to GUC_RESET_VF_STATE_RETRY_MAX times,
> which is set to a maximum of 10 attempts.

Maybe I'm wrong, but shouldn't the previous patch have prevented this?
I understand that if PF and VF are on the same host, that prev patch will cause VF
to not start resuming until PF has finished resuming.
If the VF is passed on to the VM, then I don't think there should be a problem, because
userspace (and VM) will not start resuming until the kernel on the host is ready.

So it seems to me that a situation should not arise here when VF sends the reset
button actions and the config has not yet been sent by PF to GuC.

BTW: The title of the cover letter is a bit misleading because it only mentions the PF and VF link.

Thanks,
Piotr

> 
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> Cc: Michał Wajdeczko <michal.wajdeczko at intel.com>
> Cc: Michał Winiarski <michal.winiarski at intel.com>
> Cc: Piotr Piórkowski <piotr.piorkowski at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> index 4831549da319..a439261bf4d7 100644
> --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> @@ -47,12 +47,19 @@ static int guc_action_vf_reset(struct xe_guc *guc)
>  	return ret > 0 ? -EPROTO : ret;
>  }
>  
> +#define GUC_RESET_VF_STATE_RETRY_MAX	10
>  static int vf_reset_guc_state(struct xe_gt *gt)
>  {
> +	unsigned int retry = GUC_RESET_VF_STATE_RETRY_MAX;
>  	struct xe_guc *guc = &gt->uc.guc;
>  	int err;
>  
> -	err = guc_action_vf_reset(guc);
> +	do {
> +		err = guc_action_vf_reset(guc);
> +		if (!err || err != -ETIMEDOUT)
> +			break;
> +	} while (--retry);
> +
>  	if (unlikely(err))
>  		xe_gt_sriov_err(gt, "Failed to reset GuC state (%pe)\n", ERR_PTR(err));
>  	return err;
> -- 
> 2.35.3
> 

--