[PATCH 3/3] drm/xe/pf: Don't resume device from restart worker

Fri Aug 1 20:45:53 UTC 2025

On Wed, Jul 30, 2025 at 11:48:01PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 7/30/2025 11:10 PM, Cavitt, Jonathan wrote:
> > -----Original Message-----
> > From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of Michal Wajdeczko
> > Sent: Wednesday, July 30, 2025 10:49 AM
> > To: intel-xe at lists.freedesktop.org
> > Cc: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> > Subject: [PATCH 3/3] drm/xe/pf: Don't resume device from restart worker
> >>
> >> The PF's restart worker shouldn't attempt to resume the device on
> >> its own, since its goal is to finish PF and VFs reprovisioning on
> >> the recently reset GuC. Take extra RPM reference while scheduling
> >> a work and release it from the worker or when we cancel a work.
> >>
> >> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
> >> ---
> >>  drivers/gpu/drm/xe/xe_gt_sriov_pf.c | 23 +++++++++++++++++++----
> >>  1 file changed, 19 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_pf.c b/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
> >> index 8bc7d7f9f47a..0c9012fb625d 100644
> >> --- a/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
> >> +++ b/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
> >> @@ -53,7 +53,11 @@ static void pf_init_workers(struct xe_gt *gt)
> >>  
> >>  static void pf_fini_workers(struct xe_gt *gt)
> >>  {
> >> -	disable_work_sync(&gt->sriov.pf.workers.restart);
> >> +	if (disable_work_sync(&gt->sriov.pf.workers.restart)) {
> >> +		xe_gt_sriov_dbg_verbose(gt, "pending restart disabled!\n");
> >> +		/* release a rpm reference taken on the worker behalf */
> >> +		xe_pm_runtime_put(gt_to_xe(gt));
> >> +	}
> >>  }
> >>  
> >>  /**
> >> @@ -205,8 +209,11 @@ static void pf_cancel_restart(struct xe_gt *gt)
> >>  {
> >>  	xe_gt_assert(gt, IS_SRIOV_PF(gt_to_xe(gt)));
> >>  
> >> -	if (cancel_work_sync(&gt->sriov.pf.workers.restart))
> >> +	if (cancel_work_sync(&gt->sriov.pf.workers.restart)) {
> >>  		xe_gt_sriov_dbg_verbose(gt, "pending restart canceled!\n");
> >> +		/* release a rpm reference taken on the worker behalf */
> >> +		xe_pm_runtime_put(gt_to_xe(gt));
> >> +	}
> >>  }
> >>  
> >>  /**
> >> @@ -224,9 +231,12 @@ static void pf_restart(struct xe_gt *gt)
> >>  {
> >>  	struct xe_device *xe = gt_to_xe(gt);
> >>  
> >> -	xe_pm_runtime_get(xe);
> >> +	xe_gt_assert(gt, !xe_pm_runtime_suspended(xe));
> >> +
> >>  	xe_gt_sriov_pf_config_restart(gt);
> >>  	xe_gt_sriov_pf_control_restart(gt);
> >> +
> >> +	/* release a rpm reference taken on our behalf */
> > 
> > NIT:
> > For consistency with the other two comments, maybe:
> > s/our/the worker
> > Or is the pm reference taken in this instance different from the pm reference
> > taken in pf_cancel_restart and pf_fini_workers?
> 
> this is the worker context, hence 'our'

I honestly prefer the non personal statements. But not an issue

> 
> > 
> > There're also some other minor grammar things ("s/a rpm/an rpm" and 
> > "s/worker behalf/worker's behalf", for example) that can be applied more
> > generally to the whole patch.
> > 
> > I'm not going to block on minor grammatical fixups, though, so:

My english is broken so I usually miss things like this. I even just learned
that the 'an rpm' is the right one...

But anyway, if you notice grammatical issues, please refrain on giving
your reviewed-by. Otherwise we will need to keep accepting small patches
later with small grammar fixes :/

Please ask the author a new revision and only put your rv-b when you are
comfortable that we are not leaving things behind.

> > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > -Jonathan Cavitt
> > 
> >>  	xe_pm_runtime_put(xe);
> >>  
> >>  	xe_gt_sriov_dbg(gt, "restart completed\n");
> >> @@ -245,8 +255,13 @@ static void pf_queue_restart(struct xe_gt *gt)
> >>  
> >>  	xe_gt_assert(gt, IS_SRIOV_PF(xe));
> >>  
> >> -	if (!queue_work(xe->sriov.wq, &gt->sriov.pf.workers.restart))
> >> +	/* take a rpm reference on behalf of the worker */
> >> +	xe_pm_runtime_get_noresume(xe);
> >> +
> >> +	if (!queue_work(xe->sriov.wq, &gt->sriov.pf.workers.restart)) {
> >>  		xe_gt_sriov_dbg(gt, "restart already in queue!\n");
> >> +		xe_pm_runtime_put(xe);

As for the patch itself from the power management perspective, the approach
looks good to me. It is the right thing to do if we don't want to let the
device to suspend and possibly losing power between the scheduling and
the work is done. However the hard part is to ensure that all the work
cancellation is taken care properly, what I believe you did right.

But it is important that the reviewers checked for other possible missed
spots here. I hope this is the case.

Thanks,
Rodrigo.

> >> +	}
> >>  }
> >>  
> >>  /**
> >> -- 
> >> 2.47.1
> >>
> >>
>