[PATCH 3/3] drm/i915/hwmon: Block waiting for GuC reset to complete

Thu Apr 20 15:43:52 UTC 2023

On Thu, Apr 20, 2023 at 08:57:24AM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2023 23:10, Dixit, Ashutosh wrote:
> > On Wed, 19 Apr 2023 06:21:27 -0700, Tvrtko Ursulin wrote:
> > > 
> > 
> > Hi Tvrtko,
> > 
> > > On 10/04/2023 23:35, Ashutosh Dixit wrote:
> > > > Instead of erroring out when GuC reset is in progress, block waiting for
> > > > GuC reset to complete which is a more reasonable uapi behavior.
> > > > 
> > > > v2: Avoid race between wake_up_all and waiting for wakeup (Rodrigo)
> > > > 
> > > > Signed-off-by: Ashutosh Dixit <ashutosh.dixit at intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/i915_hwmon.c | 38 +++++++++++++++++++++++++++----
> > > >    1 file changed, 33 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_hwmon.c b/drivers/gpu/drm/i915/i915_hwmon.c
> > > > index 9ab8971679fe3..8471a667dfc71 100644
> > > > --- a/drivers/gpu/drm/i915/i915_hwmon.c
> > > > +++ b/drivers/gpu/drm/i915/i915_hwmon.c
> > > > @@ -51,6 +51,7 @@ struct hwm_drvdata {
> > > > 	char name[12];
> > > > 	int gt_n;
> > > > 	bool reset_in_progress;
> > > > +	wait_queue_head_t waitq;
> > > >    };
> > > >      struct i915_hwmon {
> > > > @@ -395,16 +396,41 @@ hwm_power_max_read(struct hwm_drvdata *ddat, long *val)
> > > >    static int
> > > >    hwm_power_max_write(struct hwm_drvdata *ddat, long val)
> > > >    {
> > > > +#define GUC_RESET_TIMEOUT msecs_to_jiffies(2000)
> > > > +
> > > > +	int ret = 0, timeout = GUC_RESET_TIMEOUT;
> > > 
> > > Patch looks good to me
> > 
> > Great, thanks :)
> > 
> > > apart that I am not sure what is the purpose of the timeout? This is just
> > > the sysfs write path or has more callers?
> > 
> > It is just the sysfs path, but the sysfs is accessed also by the oneAPI
> > stack (Level 0). In the initial version I also didn't have the timeout
> > thinking that the app can send a signal to the blocked thread to unblock
> > it. I introduced the timeout after Rodrigo brought it up and I am now
> > thinking maybe it's better to have the timeout in the driver since the app
> > has no knowledge of how long GuC resets can take. But I can remove it if
> > you think it's not needed.
> 
> Maybe I am missing something but I don't get why we would need to provide a
> timeout facility in sysfs? If the library writes here to configure something
> it already has to expect a blocking write by the nature of a a write(2) and
> sysfs contract. It can take long for any reason so I hope we are not
> guaranteeing some latency number to someone? Or the concern is just about
> things getting stuck? In which case I think Ctrl-C is the answer because
> ETIME is not even listed as an errno for write(2).

I suggested the timeout on the other version because of that race,
which is fixed now with this approach. It is probably better to remove
it now to avoid confusions. I'm sorry about that.

> 
> > > If the
> > > former perhaps it would be better to just use interruptible everything
> > > (mutex and sleep) and wait for as long as it takes or until user presses
> > > Ctrl-C?
> > 
> > Now we are not holding the mutexes for long, just long enough do register
> > rmw's. So not holding the mutex across GuC reset as we were
> > originally. Therefore I am thinking mutex_lock_interruptible is not needed?
> > The sleep is already interruptible (TASK_INTERRUPTIBLE).
> 
> Ah yes, you are right.
> 
> Regards,
> 
> Tvrtko
> 
> > Anyway please let me know if you think we need to change anything.
> > 
> > Thanks.
> > --
> > Ashutosh
> > 
> > > > 	struct i915_hwmon *hwmon = ddat->hwmon;
> > > > 	intel_wakeref_t wakeref;
> > > > -	int ret = 0;
> > > > +	DEFINE_WAIT(wait);
> > > > 	u32 nval;
> > > >    -	mutex_lock(&hwmon->hwmon_lock);
> > > > -	if (hwmon->ddat.reset_in_progress) {
> > > > -		ret = -EAGAIN;
> > > > -		goto unlock;
> > > > +	/* Block waiting for GuC reset to complete when needed */
> > > > +	for (;;) {
> > > > +		mutex_lock(&hwmon->hwmon_lock);
> > > > +
> > > > +		prepare_to_wait(&ddat->waitq, &wait, TASK_INTERRUPTIBLE);
> > > > +
> > > > +		if (!hwmon->ddat.reset_in_progress)
> > > > +			break;
> > > > +
> > > > +		if (signal_pending(current)) {
> > > > +			ret = -EINTR;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (!timeout) {
> > > > +			ret = -ETIME;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		mutex_unlock(&hwmon->hwmon_lock);
> > > > +
> > > > +		timeout = schedule_timeout(timeout);
> > > > 	}
> > > > +	finish_wait(&ddat->waitq, &wait);
> > > > +	if (ret)
> > > > +		goto unlock;
> > > > +
> > > > 	wakeref = intel_runtime_pm_get(ddat->uncore->rpm);
> > > > 		/* Disable PL1 limit and verify, because the limit cannot be
> > > > disabled on all platforms */
> > > > @@ -508,6 +534,7 @@ void i915_hwmon_power_max_restore(struct drm_i915_private *i915, bool old)
> > > > 	intel_uncore_rmw(hwmon->ddat.uncore, hwmon->rg.pkg_rapl_limit,
> > > > 			 PKG_PWR_LIM_1_EN, old ? PKG_PWR_LIM_1_EN : 0);
> > > > 	hwmon->ddat.reset_in_progress = false;
> > > > +	wake_up_all(&hwmon->ddat.waitq);
> > > > 		mutex_unlock(&hwmon->hwmon_lock);
> > > >    }
> > > > @@ -784,6 +811,7 @@ void i915_hwmon_register(struct drm_i915_private *i915)
> > > > 	ddat->uncore = &i915->uncore;
> > > > 	snprintf(ddat->name, sizeof(ddat->name), "i915");
> > > > 	ddat->gt_n = -1;
> > > > +	init_waitqueue_head(&ddat->waitq);
> > > > 		for_each_gt(gt, i915, i) {
> > > > 		ddat_gt = hwmon->ddat_gt + i;