[PATCH 2/2] drm/xe/vf: Retry sending MMIO request to GUC on timeout error

Thu Feb 20 15:18:47 UTC 2025

> -----Original Message-----
> From: Piorkowski, Piotr <piotr.piorkowski at intel.com>
> Sent: Thursday, February 20, 2025 5:23 PM
> To: K V P, Satyanarayana <satyanarayana.k.v.p at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Wajdeczko, Michal
> <Michal.Wajdeczko at intel.com>; Winiarski, Michal
> <michal.winiarski at intel.com>
> Subject: Re: [PATCH 2/2] drm/xe/vf: Retry sending MMIO request to GUC on
> timeout error
> 
> Satyanarayana K V P <satyanarayana.k.v.p at intel.com> wrote on czw [2025-
> lut-20 12:11:19 +0530]:
> > Add support to allow retrying the sending of MMIO requests
> > from the VF to the GUC in the event of an error. During the
> > suspend/resume process, VFs begin resuming only after the PF has
> > resumed. Although the PF resumes, the GUC reset and provisioning
> > occur later in a separate worker process.
> >
> > When there are a large number of VFs, some may attempt to resume
> > before the PF has completed its provisioning. Therefore, if a
> > MMIO request from a VF fails during this period, we will retry
> > sending the request up to GUC_RESET_VF_STATE_RETRY_MAX times,
> > which is set to a maximum of 10 attempts.
> 
> Maybe I'm wrong, but shouldn't the previous patch have prevented this?
> I understand that if PF and VF are on the same host, that prev patch will cause
> VF
> to not start resuming until PF has finished resuming.
> If the VF is passed on to the VM, then I don't think there should be a problem,
> because
> userspace (and VM) will not start resuming until the kernel on the host is
> ready.
> 
> So it seems to me that a situation should not arise here when VF sends the
> reset
> button actions and the config has not yet been sent by PF to GuC.
> 
> BTW: The title of the cover letter is a bit misleading because it only mentions
> the PF and VF link.
> 
> Thanks,
> Piotr
> 
As mentioned in the commit message, The VF reset and provisioning are happening
In a worker.  Once the PF resumption is completed, VFs start to resume. In case of
More VFs (say 16), all (16) VFs try to resume together. 
VF provision happens one after another and so, VFs with greater number start resume
before they are provisioned. 

This scenario is valid only when VFs are probed on the host and not an issue when passed
VF to VM. 

Creation of link between PF and VF is the main feature and this patch fixes an issue 
because of that. So, the cover letter is titled as "Create a link between PF and VF".
> >
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> > Cc: Michał Wajdeczko <michal.wajdeczko at intel.com>
> > Cc: Michał Winiarski <michal.winiarski at intel.com>
> > Cc: Piotr Piórkowski <piotr.piorkowski at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > index 4831549da319..a439261bf4d7 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c
> > @@ -47,12 +47,19 @@ static int guc_action_vf_reset(struct xe_guc *guc)
> >  	return ret > 0 ? -EPROTO : ret;
> >  }
> >
> > +#define GUC_RESET_VF_STATE_RETRY_MAX	10
> >  static int vf_reset_guc_state(struct xe_gt *gt)
> >  {
> > +	unsigned int retry = GUC_RESET_VF_STATE_RETRY_MAX;
> >  	struct xe_guc *guc = &gt->uc.guc;
> >  	int err;
> >
> > -	err = guc_action_vf_reset(guc);
> > +	do {
> > +		err = guc_action_vf_reset(guc);
> > +		if (!err || err != -ETIMEDOUT)
> > +			break;
> > +	} while (--retry);
> > +
> >  	if (unlikely(err))
> >  		xe_gt_sriov_err(gt, "Failed to reset GuC state (%pe)\n",
> ERR_PTR(err));
> >  	return err;
> > --
> > 2.35.3
> >
> 
> --