[PATCH 1/2] drm/xe/guc_pc: Do not stop probe or resume if GuC PC fails

Fri Feb 14 17:22:54 UTC 2025

On 2/14/2025 7:00 AM, Rodrigo Vivi wrote:
> On Thu, Feb 13, 2025 at 05:37:34PM -0800, Belgaumkar, Vinay wrote:
>> On 2/12/2025 10:15 AM, Rodrigo Vivi wrote:
>>> On Tue, Feb 11, 2025 at 05:19:14PM -0800, Belgaumkar, Vinay wrote:
>>>> On 2/11/2025 12:09 PM, Rodrigo Vivi wrote:
>>>>> In a rare situation of thermal limit during resume, GuC can
>>>>> be slow and run into delays like this:
>>>>>
>>>>> xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
>>>>>       		 [status = 0x8002F034, timeouts = 0]
>>>>> xe 0000:00:02.0: [drm] GT1: excessive init time: \
>>>>>       		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
>>>>>       		 perf_limit_reasons = 0x1C001000]
>>>>> xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
>>>>> ------------[ cut here ]------------
>>>>> xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO
>>>>>
>>>>> If this happens, this can block entirely the GPU to be used.
>>>>> However, GPU can still be used, although the GT frequencies might be
>>>>> messed up.
>>>>>
>>>>> Let's report the error, but not block the flow.
>>>> Can we expect other random CI failures due to this? If GT is not getting
>>>> expected frequencies, certain tests which rely on this will likely fail,
>>>> causing a bunch of noise. Is that worse than driver load failing in this
>>>> case?
>>> This issue which I pasted the log above is blocking the resume of the
>>> a LNL laptop. Everything goes blank forcing the user to reboot the
>>> laptop.
>>>
>>> I prefer to have to deal with CI noise with bugs that we can work on
>>> than blocking users resume.
>>>
>>> But well, we are still waiting one entire extra second there.
>>> That should be more than enough even with the thermal limited
>>> condition there. So, I'm not expecting more bugs than we already
>>> have.
>>>
>>> Also, our IGT test cases are prepared to deal with some EAGAIN
>>> returns right? The probe and resume functions are not....
>>>
>>> But well, any suggestion here on a more robust approach?
>>> Or can we go with this one?
>> True, this will unblock resume. However, if this is a pcode bug, we will
>> allow boot in spite of a persistent failure to get anything above Pmin.
>> Maybe we can print the frequencies again here and explicitly warn about the
>> loss of dynamic frequencies and GuCRC (and all freq/c6 related interfaces)
>> from here on?
> Your gut feeling that something was not right paid off... The ret = 0 and
> the goto out were in the wrong if. Even if the second wait succeeded we
> would goto out without doing the proper freq and gucrc initialization.
Yes, that was in my comments below :)
>
> So, what about something like this then:
>
> -               xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
> -               if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
> -                       xe_gt_err(gt, "GuC PC Start failed\n");
> -               /* Although GuC PC failed, do not block the usage of GPU */
> -               ret = 0;
> -               goto out;
> +               xe_gt_warn(gt, "GuC PC excessive start time: [freq = %dMHz (req = %dMHz), perf_limit_reasons = 0x%08X]\n",
> +                          xe_guc_pc_get_act_freq(pc), get_cur_freq(pc),
> +                          xe_gt_throttle_get_limit_reasons(gt));
> +               if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000)) {
> +                       xe_gt_err(gt, "GuC PC Start failed: Dynamic GT frequency control and GT sleep states are now disabled.\n");
> +                       /* Although GuC PC failed, do not block the usage of GPU */
> +                       ret = 0;
> +                       goto out;
> +               }

Yup, I think this will work.

Thanks,

Vinay.

>
>>> Thanks,
>>> Rodrigo.
>>>
>>>> Thanks,
>>>>
>>>> Vinay.
>>>>
>>>>> But, instead of just giving up and moving on, let's re-attempt a wait
>>>>> with a very long second timeout.
>>>>>
>>>>> v2: Keep the precision comment (Jonathan)
>>>>>        Use a define for the regular SLPC reset timeout.
>>>>>
>>>>> Cc: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>>>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_guc_pc.c | 26 ++++++++++++++++++--------
>>>>>     1 file changed, 18 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
>>>>> index 02409eedb914..3b04b62937eb 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>>>>> @@ -50,6 +50,8 @@
>>>>>     #define LNL_MERT_FREQ_CAP	800
>>>>>     #define BMG_MERT_FREQ_CAP	2133
>>>>> +#define SLPC_RESET_TIMEOUT_MS 5 /* rought 5ms, but no need for precision */
>>>>> +
>>>>>     /**
>>>>>      * DOC: GuC Power Conservation (PC)
>>>>>      *
>>>>> @@ -114,9 +116,10 @@ static struct iosys_map *pc_to_maps(struct xe_guc_pc *pc)
>>>>>     	 FIELD_PREP(HOST2GUC_PC_SLPC_REQUEST_MSG_1_EVENT_ARGC, count))
>>>>>     static int wait_for_pc_state(struct xe_guc_pc *pc,
>>>>> -			     enum slpc_global_state state)
>>>>> +			     enum slpc_global_state state,
>>>>> +			     int timeout_ms)
>>>>>     {
>>>>> -	int timeout_us = 5000; /* rought 5ms, but no need for precision */
>>>>> +	int timeout_us = 1000 * timeout_ms;
>>>>>     	int slept, wait = 10;
>>>>>     	xe_device_assert_mem_access(pc_to_xe(pc));
>>>>> @@ -165,7 +168,8 @@ static int pc_action_query_task_state(struct xe_guc_pc *pc)
>>>>>     	};
>>>>>     	int ret;
>>>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>>>     		return -EAGAIN;
>>>>>     	/* Blocking here to ensure the results are ready before reading them */
>>>>> @@ -188,7 +192,8 @@ static int pc_action_set_param(struct xe_guc_pc *pc, u8 id, u32 value)
>>>>>     	};
>>>>>     	int ret;
>>>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>>>     		return -EAGAIN;
>>>>>     	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
>>>>> @@ -209,7 +214,8 @@ static int pc_action_unset_param(struct xe_guc_pc *pc, u8 id)
>>>>>     	struct xe_guc_ct *ct = &pc_to_guc(pc)->ct;
>>>>>     	int ret;
>>>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING))
>>>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>>>> +			      SLPC_RESET_TIMEOUT_MS))
>>>>>     		return -EAGAIN;
>>>>>     	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
>>>>> @@ -1033,9 +1039,13 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>>>>>     	if (ret)
>>>>>     		goto out;
>>>>> -	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING)) {
>>>>> -		xe_gt_err(gt, "GuC PC Start failed\n");
>>>>> -		ret = -EIO;
>>>>> +	if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING,
>>>>> +			      SLPC_RESET_TIMEOUT_MS)) {
>>>>> +		xe_gt_warn(gt, "GuC PC Start taking longer than expected\n");
>>>>> +		if (wait_for_pc_state(pc, SLPC_GLOBAL_STATE_RUNNING, 1000))
>>>>> +			xe_gt_err(gt, "GuC PC Start failed\n");
>>>>> +		/* Although GuC PC failed, do not block the usage of GPU */
>>>>> +		ret = 0;
>> Looks like we are skipping SLPC init even if we succeed in getting the right
>> pc_state on the retry? We should continue with normal init in that case(need
>> an else).
>>
>> Thanks,
>>
>> Vinay.
>>
>>>>>     		goto out;
>>>>>     	}