[PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

Fri Feb 14 15:00:42 UTC 2025

On Thu, Feb 13, 2025 at 05:37:34PM -0800, Belgaumkar, Vinay wrote:
> 
> On 2/12/2025 10:15 AM, Rodrigo Vivi wrote:
> > On Tue, Feb 11, 2025 at 05:19:14PM -0800, Belgaumkar, Vinay wrote:
> > > On 2/11/2025 12:09 PM, Rodrigo Vivi wrote:
> > > > In a rare situation of thermal limit during resume, GuC can
> > > > be slow and run into delays like this:
> > > > 
> > > > xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
> > > >      		 [status = 0x8002F034, timeouts = 0]
> > > > xe 0000:00:02.0: [drm] GT1: excessive init time: \
> > > >      		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
> > > >      		 perf_limit_reasons = 0x1C001000]
> > > > xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
> > > > ------------[ cut here ]------------
> > > > xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
> > > > 
> > > > If this happens, this can block entirely the GPU to be used.
> > > > However, GPU can still be used, although the GT frequencies might be
> > > > messed up.
> > > > 
> > > > Let's report the error, but not block the flow.
> > > Can we expect other random CI failures due to this? If GT is not getting
> > > expected frequencies, certain tests which rely on this will likely fail,
> > > causing a bunch of noise. Is that worse than driver load failing in this
> > > case?
> > This issue which I pasted the log above is blocking the resume of the
> > a LNL laptop. Everything goes blank forcing the user to reboot the
> > laptop.
> > 
> > I prefer to have to deal with CI noise with bugs that we can work on
> > than blocking users resume.
> > 
> > But well, we are still waiting one entire extra second there.
> > That should be more than enough even with the thermal limited
> > condition there. So, I'm not expecting more bugs than we already
> > have.
> > 
> > Also, our IGT test cases are prepared to deal with some EAGAIN
> > returns right? The probe and resume functions are not....
> > 
> > But well, any suggestion here on a more robust approach?
> > Or can we go with this one?
> 
> True, this will unblock resume. However, if this is a pcode bug, we will
> allow boot in spite of a persistent failure to get anything above Pmin.
> Maybe we can print the frequencies again here and explicitly warn about the
> loss of dynamic frequencies and GuCRC (and all freq/c6 related interfaces)
> from here on?

Your gut feeling that something was not right paid off... The ret = 0 and
the goto out were in the wrong if. Even if the second wait succeeded we
would goto out without doing the proper freq and gucrc initialization.

So, what about something like this then:

-               xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
-               if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
-                       xe_gt_err(gt, "GuC PC Start failed\n");
-               /* Although GuC PC failed, do not block the usage of GPU */
-               ret = 0;
-               goto out;
+               xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
+                          xe_guc_pc_get_act_freq(pc), get_cur_freq(pc),
+                          xe_gt_throttle_get_limit_reasons(gt));
+               if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
+                       xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
+                       /* Although GuC PC failed, do not block the usage of GPU */
+                       ret = 0;
+                       goto out;
+               }

> 
> > 
> > Thanks,
> > Rodrigo.
> > 
> > > Thanks,
> > > 
> > > Vinay.
> > > 
> > > > But, instead of just giving up and moving on, let's re-attempt a wait
> > > > with a very long second timeout.
> > > > 
> > > > v2: Keep the precision comment (Jonathan)
> > > >       Use a define for the regular SLPC reset timeout.
> > > > 
> > > > Cc: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
> > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_guc_pc.c | 26 ++++++++++++++++++--------
> > > >    1 file changed, 18 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > index 02409eedb914..3b04b62937eb 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> > > > @@ -50,6 +50,8 @@
> > > >    #define LNL_MERT_FREQ_CAP	800
> > > >    #define BMG_MERT_FREQ_CAP	2133
> > > > +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
> > > > +
> > > >    /**
> > > >     * DOC: GuC Power Conservation (PC)
> > > >     *
> > > > @@ -114,9 +116,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
> > > >    	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
> > > >    static int wait_for_pc_state(struct xe_guc_pc *pc,
> > > > -			     enum slpc_global_state state)
> > > > +			     enum slpc_global_state state,
> > > > +			     int timeout_ms)
> > > >    {
> > > > -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
> > > > +	int timeout_us = 1000 * timeout_ms;
> > > >    	int slept, wait = 10;
> > > >    	xe_device_assert_mem_access(pc_to_xe(pc));
> > > > @@ -165,7 +168,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
> > > >    	};
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	/* Blocking here to ensure the results are ready before reading them */
> > > > @@ -188,7 +192,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
> > > >    	};
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > @@ -209,7 +214,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
> > > >    	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
> > > >    	int ret;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS))
> > > >    		return -EAGAIN;
> > > >    	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> > > > @@ -1033,9 +1039,13 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> > > >    	if (ret)
> > > >    		goto out;
> > > > -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
> > > > -		xe_gt_err(gt, "GuC PC Start failed\n");
> > > > -		ret = -EIO;
> > > > +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
> > > > +			      SLPC_RESET_TIMEOUT_MS)) {
> > > > +		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
> > > > +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
> > > > +			xe_gt_err(gt, "GuC PC Start failed\n");
> > > > +		/* Although GuC PC failed, do not block the usage of GPU */
> > > > +		ret = 0;
> 
> Looks like we are skipping SLPC init even if we succeed in getting the right
> pc_state on the retry? We should continue with normal init in that case(need
> an else).
> 
> Thanks,
> 
> Vinay.
> 
> > > >    		goto out;
> > > >    	}