[Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

Wed Oct 4 21:36:35 UTC 2023

On Wed, Oct 04, 2023 at 10:58:32PM +0200, Andi Shyti wrote:
> Hi Matt,
> 
> > > > > > > The MCR steering semaphore is a shared lock entry between i915
> > > > > > > and various firmware components.
> > > > > > > 
> > > > > > > Getting the lock might sinchronize on some shared resources.
> > > > > > > Sometimes though, it might happen that the firmware forgets to
> > > > > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > > > > what was supposed to do, ignoring the problem.
> > > > > > > 
> > > > > > > Do not consider this failure as an error, but just print a debug
> > > > > > > message stating that the MCR locking has been skipped.
> > > > > > > 
> > > > > > > On the driver side we still have spinlocks that make sure that
> > > > > > > the access to the resources is serialized.
> > > > > > > 
> > > > > > > Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
> > > > > > > Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > > > > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > > > > Cc: Nirmoy Das <nirmoy.das at intel.com>
> > > > > > > ---
> > > > > > >    drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++----
> > > > > > >    1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
> > > > > > >    	 * would indicate some hardware/firmware is misbehaving and not
> > > > > > >    	 * releasing it properly.
> > > > > > >    	 */
> > > > > > > -	if (err == -ETIMEDOUT) {
> > > > > > > -		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
> > > > > > > -		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
> > > > > > > -	}
> > > > > > > +	if (err == -ETIMEDOUT)
> > > > > > > +		gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > > > > >    }
> > > > > > >    /**
> > > > > > Are we sure this does not warrant a level higher than dbg, such as
> > > > > > notice/warn?
> > > > > We might make it a warn, but this doesn't change much the economy
> > > > > of the driver as we will keep doing what we were supposed to do.
> > > > > 
> > > > > > Because how can we be sure the two entities will not stomp on
> > > > > > each other toes if we failed to obtain lock?
> > > > > So far, in all the research I've done, no one looks like using
> > > > > MCR lock, but yet someone is stuck in it.
> > > > 
> > > > If someone has the lock then that someone thinks they are using it. Just
> > > > because you can't see what someone piece of IFWI is doing doesn't mean it
> > > > isn't doing it. And if it is a genuinely missing unlock then it needs to be
> > > > tracked down and fixed with an IFWI update otherwise the system is going to
> > > > be unstable from that point on.
> > > 
> > > But I'm not changing here the behavior of the driver. The driver
> > > will keep doing what was doing before.
> > > 
> > > Because this most probably won't be noticed by the user, then I
> > > don't see why it should shout out loud that the system is
> > > unusable when most probably it is.
> > 
> > That's like saying that any random race condition isn't likely to be
> > noticed by the user so it's not a big deal if we're missing a few
> > mutexes or spinlocks somewhere...even though there's likely to be no
> > user-visible impact to any race condition 99% of the time, it's the 1%
> > that winds up being absolutely catastrophic.
> 
> Not really... normally if you hit a spinlock/mutex race
> condition, you end up in a deadlock and stall the system. In this
> case, I agree that the lock should be sorted out by the hardware,
> but in the meantime i915 is *already* ignoring it.
> 
> I'm not making any behavioral change with this patch.
> 
> What I'm trying to say is that if the system doesn't crash, then
> let it go... don't crash it on purpose just because there is a
> locking situation and we think it has a catastrophic effect, but
> in reality is not producing anything (like practically in our
> case, you can check the CI results[*]).
> 
> [*] https://patchwork.freedesktop.org/patch/560733/?series=124599&rev=1

But hiding the error message is the opposite of the direction we should
be moving.  If anything we need to be escalating this harder, for
example by wedging the GT to truly prevent if from being used further.
We obviously don't want a BUG() or something that would crash the whole
system and potentially cause non-graphics data loss, but we really
shouldn't let regular operation keep going.  The goal of ignoring the
the semaphore and moving forward is that we keep the system alive for a
developer to gather more debug information.

If this is ever seen in the wild by a regular user, we definitely want
to be getting bug reports about that ASAP because it means they have a
system that they can't safely use.

> 
> > If we're not obtaining the MCR lock as expected and are simply moving
> > forward to force our own steering (possibly at the same time firmware is
> > programming steering to a different value), you probably won't actually
> > see a problem either because the operations won't wind up interleaving
> > in a problematic order, or because the driver and the firmware both
> > happen to be trying to steer to the same instance (e.g., instance #0 is
> > a quite common target).  But even if they're hard to hit, the
> > possibility for a major problem is still there and basically we need to
> > consider the whole platform to be in an unknown, unstable state once
> > we've detected one of these issues.
> > 
> > Based on some earlier experiments, it sounds like the problem at the
> > moment is that we've just been too hasty with deciding that the lock is
> > "stuck."  There's no formal guidance on what an appropriate timeout is,
> > but Jonathan's patch to increase the timeout by 10x (from 100ms to 1s)
> > should hopefully take care of those cases and prevent us from causing
> > any unprotected races.
> 
> Unfortunately it doesn't. Here[**] you can see how the situation
> doesn't change in mtlp-8.
> 
> [**] https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124576v2/index.html?

That's unfortunate. :-(

Probably the next step is to re-run the ftrace experiements and confirm
once again that it's still definitely an external firmware agent that
we're stuck waiting on and not some other i915 thread.  "MCR lock
timeout" is a general type of problem that could potentially be caused
by different things in different situations; i915 wasn't the culprit for
the suspend/resume case, but we should confirm that it also isn't the
culprit for whichever specific CI tests are failing here.  I know we've
already audited our critical sections to make sure that they're all
short and aren't holding the lock for a long time, but I'm imagining a
case where an IGT stress test spawns thousands of threads, each one
doing something that will wind up triggering MCR operations at the same
time, and one of those threads just winds up waiting too long for the
crowd to thin out and give it a chance with the lock.  I think our CI
system runs with KASAN enabled now, which often brings CPU performance
to a crawl, so if this is some kind of thundering herd situation that's
being exacerbated by KASAN, we should be able to figure that out with
tracepoints.

Matt

> 
> > If we have any other problems where the lock is
> > truly stuck (as was seen after an S3 resume), that's a critical error
> > that needs to be escalated to whichever component is the culprit --- any
> > such system simply cannot be used safely.  Even if the KMD didn't notice
> > such stuck semaphores itself, one misbehaving firmware agent could still
> > block other firmware agents and cause major problems for the health of
> > the system.
> 
> Agree... we are working with hardware guys to get down to the
> root cause. We knew already when we merged this patch[***] that
> this wouldn't fix the issue as the issue still lies underneath.
> 
> [***] 37280ef5c1c4 ("drm/i915: Clean steer semaphore on resume")
> 
> > > BTW, at my understanding this is not an IFWI problem. We checked
> > > with different version of IFWI and the issue is the same.
> > 
> > It may not be an IFWI _regression_, but unless we're contending with
> > ourselves (via different CPU threads, which I think we've already ruled
> > out through some ftrace experiments), then it does mean that something
> > in the IFWI is causing the lock to be held longer than expected.  That
> > may have always been the behavior since day 1 and we just didn't notice
> > until we got a relatively clean CI setup to start seeing these errors.
> > 
> > > 
> > > Besides we received reports also from systems that are not using
> > > IFWI at all.
> > 
> > What does this mean?  IFWI is just the generic term we use for the blob
> > we flash onto the system that bundles a bunch of different types of
> > firmware.  E.g., stuff like pcode firmware, csme, EFI GOP, etc.  If a
> > 3rd party is putting together a MTL device, they probably refer to their
> > firmware bundles with some term other than "IFWI" and may swap out some
> > of the specific components, but at the end of the day the system still
> > has the important low-level firmware running for things like pcode.
> > 
> > > 
> > > > 
> > > > > 
> > > > > > (How can we be sure about
> > > > > > "forgot to unlock" vs "in prolonged active use"?
> > > > > There is a patch from Jonathan that is testing a different
> > > > > timeout.
> > > > > 
> > > > > > Or if we can be sure, can
> > > > > > we force unlock and take the lock for the driver explicitly?)
> > > > > I sent a patch with this solution and Matt turned it down.
> > > > 
> > > > Presumably because both forcing a lock and ignoring a failed lock are Bad
> > > > Things to be doing!
> > > > Just because some entity out of our control isn't playing friendly doesn't
> > > > mean we can ignore the problem. The lock is there for a reason. If someone
> > > > else owns the lock then any steered access by i915 is potentially going to
> > > > be routed to the wrong register as the other entity re-directs the steering
> > > > behind out back. That is why there is an error message being printed.
> > > > Because things are quite possibly going to fail in some unknown manner.
> > > 
> > > Yes, agree. This has been already discussed here[*] where I sent
> > > such RFC for the sole purpose of receiving some opinions and
> > > check how CI would behave.
> > > 
> > > BTW, we are already doing this during the GT resume[**]... it's a
> > > bit of a different context, but it still forces the release of
> > > the lock.
> > 
> > That resume-time patch is only deemed safe because we got explicit
> > confirmation from the architects that it's not possible for any of the
> > external agents to truly be using the MCR lock at that point during
> > driver load and S3 resume.  It's still a serious bug in some firmware
> > component, but in that specific case it's one that we can fix up
> > ourselves without risk.  Any locking failures seen later on during
> > regular system use cannot be solves safely.
> 
> Yes! Agree! :-)
> 
> Thanks, Matt!
> Andi

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation