[PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

Thu Mar 6 23:36:35 UTC 2025

On Fri, Feb 28, 2025 at 03:32:54PM -0500, Rodrigo Vivi wrote:
> On Fri, Feb 28, 2025 at 12:13:24PM -0800, John Harrison wrote:
> > On 2/28/2025 11:45, Rodrigo Vivi wrote:
> > > On Fri, Feb 28, 2025 at 11:22:02AM -0800, John Harrison wrote:
> > > > On 2/14/2025 09:25, Rodrigo Vivi wrote:
> > > > > In a rare situation of thermal limit during resume, GuC can
> > > > > be slow and run into delays like this:
> > > > > 
> > > > > xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
> > > > >      		 [status = 0x8002F034, timeouts = 0]
> > > > > xe 0000:00:02.0: [drm] GT1: excessive init time: \
> > > > >      		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
> > > > >      		 perf_limit_reasons = 0x1C001000]
> > > > > xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
> > > > > ------------[ cut here ]------------
> > > > > xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
> > > > > 
> > > > > If this happens, this can block entirely the GPU to be used.
> > > > > However, GPU can still be used, although the GT frequencies might be
> > > > > messed up.
> > > > > 
> > > > > Let's report the error, but not block the flow.
> > > > > But, instead of just giving up and moving on, let's re-attempt a wait
> > > > > with a very long second timeout.
> > > > > 
> > > > > v2: Keep the precision comment (Jonathan)
> > > > >       Use a define for the regular SLPC reset timeout.
> > > > > v3: Improve messages (Vinay)
> > > > >       Only skip initialization if the second full-second wait failed.
> > > > > 
> > > > > Cc: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
> > > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com> #v2
> > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_guc_pc.c | 46 ++++++++++++++++++++++++----------
> > > > >    1 file changed, 33 insertions(+), 13 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > > index 02409eedb914..74cc13012532 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > > @@ -20,6 +20,7 @@
> > > > >    #include "xe_gt.h"
> > > > >    #include "xe_gt_idle.h"
> > > > >    #include "xe_gt_printk.h"
> > > > > +#include "xe_gt_throttle.h"
> > > > >    #include "xe_gt_types.h"
> > > > >    #include "xe_guc.h"
> > > > >    #include "xe_guc_ct.h"
> > > > > @@ -50,6 +51,8 @@
> > > > >    #define LNL_MERT_FREQ_CAP	800
> > > > >    #define BMG_MERT_FREQ_CAP	2133
> > > > > +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
> > > > > +
> > > > >    /**
> > > > >     * DOC: GuC Power Conservation (PC)
> > > > >     *
> > > > > @@ -114,9 +117,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
> > > > >    	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
> > > > >    static int wait_for_pc_state(struct xe_guc_pc *pc,
> > > > > -			     enum slpc_global_state state)
> > > > > +			     enum slpc_global_state state,
> > > > > +			     int timeout_ms)
> > > > >    {
> > > > > -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
> > > > > +	int timeout_us = 1000 * timeout_ms;
> > > > >    	int slept, wait = 10;
> > > > >    	xe_device_assert_mem_access(pc_to_xe(pc));
> > > > > @@ -165,7 +169,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
> > > > >    	};
> > > > >    	int ret;
> > > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > > >    		return -EAGAIN;
> > > > >    	/* Blocking here to ensure the results are ready before reading them */
> > > > > @@ -188,7 +193,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
> > > > >    	};
> > > > >    	int ret;
> > > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > > >    		return -EAGAIN;
> > > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > > @@ -209,7 +215,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
> > > > >    	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
> > > > >    	int ret;
> > > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > > >    		return -EAGAIN;
> > > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > > @@ -443,6 +450,15 @@ u32 xe_guc_pc_get_act_freq(struct xe_guc_pc *pc)
> > > > >    	return freq;
> > > > >    }
> > > > > +static u32 get_cur_freq(struct xe_gt *gt)
> > > > > +{
> > > > > +	u32 freq;
> > > > > +
> > > > > +	freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
> > > > > +	freq = REG_FIELD_GET(REQ_RATIO_MASK, freq);
> > > > > +	return decode_freq(freq);
> > > > > +}
> > > > > +
> > > > >    /**
> > > > >     * xe_guc_pc_get_cur_freq - Get Current requested frequency
> > > > >     * @pc: The GuC PC
> > > > > @@ -466,10 +482,7 @@ int xe_guc_pc_get_cur_freq(struct xe_guc_pc *pc, u32 *freq)
> > > > >    		return -ETIMEDOUT;
> > > > >    	}
> > > > > -	*freq = xe_mmio_read32(&gt->mmio, RPNSWREQ);
> > > > > -
> > > > > -	*freq = REG_FIELD_GET(REQ_RATIO_MASK, *freq);
> > > > > -	*freq = decode_freq(*freq);
> > > > > +	*freq = get_cur_freq(gt);
> > > > >    	xe_force_wake_put(gt_to_fw(gt), fw_ref);
> > > > >    	return 0;
> > > > > @@ -1033,10 +1046,17 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> > > > >    	if (ret)
> > > > >    		goto out;
> > > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
> > > > > -		xe_gt_err(gt, "GuC PC Start failed\n");
> > > > > -		ret = -EIO;
> > > > > -		goto out;
> > > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > > +			      SLPC_RESET_TIMEOUT_MS)) {
> > > > > +		xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
> > > > > +			   xe_guc_pc_get_act_freq(pc), get_cur_freq(gt),
> > > > > +			   xe_gt_throttle_get_limit_reasons(gt));
> > > > > +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
> > > > Shouldn't this be a define as well - SLPC_RESET_EXTENDED_TIMEOUT_MS or
> > > > something?
> > > good idea! will do.

done

> > > 
> > > > More importantly, Is 1ms enough of an extra wait?
> > > The new timeout argument is in ms, so it is 1 second.
> > Doh! Yes, I saw that but then completely spaced it out again!
> > 
> > > 
> > > > If the GT freq is 100MHz
> > > > instead of 2GHz or some such then the expected max of 5ms could now be more
> > > > like 100ms if not even longer (the slow down does not seem linear). As an
> > > > example, the GuC load itself should be <10ms but with clamped frequencies we
> > > > generally see over 500ms, sometimes over 1s.
> > > hmm... over 1s possible? so, perhaps 1250 to be on the safe side?
> > > other suggestions?
> > I think a second should be good but I don't what is involved in the SLPC
> > start up? The long delay loading the GuC is due to doing decryption which is
> > a hugely CPU intensive task and the GuC is not a huge CPU! If SLPC is more
> > about waiting for hardware to respond then maybe the slow down won't be as
> > severe? Plus the GuC load is inherently slower in the first place - our
> > original timeout was 200ms with expected values in the 5-15ms range. If SLPC
> > is starting from a 5ms timeout then presumably the expected time is actually
> > more like 1ms or less?
> 
> Yeap, I randomly put a big wait because I wasn't sure why/what.
> 
> > 
> > You could try running with the frequency manually set to 300MHz and see how
> > long it takes. I think that is the lowest we can explicitly request from the
> > KMD?
> 
> Great idea! Although it can change a lot by platform and SKUs, but we could
> have at least a rough idea instead of a blind big guess.

Well, it looks like when everything is normal, we shouldn't take
longer than 300 *us* even with lower frequencies. So I could not quite
reproduce a case like that.

But well, let's try with this big hammer second wait here and warn on how
long it takes, so we can tune it down later.

> 
> > 
> > > 
> > > > > +			xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
> > > > > +			/* Although GuC PC failed, do not block the usage of GPU */
> > > > > +			ret = 0;
> > > > I thought the new policy was that any subsystem failure should now be
> > > > considered fatal and abort driver load? I recall a PXP start failure was
> > > > recently upgrading to being fatal even though PXP is almost never used by
> > > > any actual users. SLPC seems much more vital to the system than PXP!
> > > Hmm... good point! I have to get back to the board then and have
> > > this logic only for the resume?!
> > > 
> > > If this happens during the probe yeap, let's block because subsystems
> > > are buggy. But the case I'm hunting here is a resume from S2idle that
> > > is entirely hanging the platform when this happens under thermal constrains.
> > Hmm. What platform is the problem showing up on? There are a couple of other
> > bug reports about systems coming up in an odd state after suspend - e.g. GuC
> > image not loading due to memory corruption. I wonder if it is not actually a
> > thermal problem but just something confused due to uninitialised state
> > somewhere? Plus, how can you be in thermal meltdown on a resume? If the
> > power was lost then the device should be cold!
> 
> Indeed. It was a LNL case in a very specific kernel version. Issue is not
> reproducible anymore. But with that bug I realized we were actually entirely
> hanging the platform on resume and this is not a good approach, even though
> the original issue was not ours.

Okay, In the new version we are still going to block the probe. If it is
less than the extended timeout it prints out how long it took, otherwise
we continue with the -ETIMEDOUT return and block the probe/resume.

> 
> > 
> > > 
> > > Thoughts? I'm open to suggestions here.
> > My main thought is that if the frequency is clamped (by the hardware itself)
> > at absolute minimum then the system is not going to be very usable anyway.
> > So is continuing to run by using huge timeouts actually beneficial? But not
> > sure what else we can do at this point? Maybe try an FLR? But yeah, it is
> > probably good to try harder to keep going on a resume than on first driver
> > load.
> 
> Well, with the resume happening, the FLR could be a bad hammer. But well,
> worth considering indeed. I will do some more experiments around and see
> our options. But the hang as currently is is the worst scenario.

I couldn't get anything like that to work reliably. If we are in that
thermal bad resume condition, the flr can cause other side effects.
Let's just wait a bit longer and log (warn) the case and move on.

Sending the new version soon...

Thank you!

> 
> Thanks a lot again,
> Rodrigo.
> 
> > 
> > John.
> > 
> > > 
> > > Thanks a lot for raising these so far,
> > > Rodrigo.
> > > 
> > > > John.
> > > > 
> > > > > +			goto out;
> > > > > +		}
> > > > >    	}
> > > > >    	ret = pc_init_freqs(pc);
> >