[PATCH] drm/xe/guc: Escalate GuC load failure immediately

Wed Feb 12 19:11:11 UTC 2025


> -----Original Message-----
> From: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>
> Sent: Tuesday, February 11, 2025 10:38 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost at intel.com>; Harrison, John C
> <john.c.harrison at intel.com>
> Subject: Re: [PATCH] drm/xe/guc: Escalate GuC load failure immediately
> 
> 
> 
> On 2/11/2025 1:53 AM, Upadhyay, Tejas wrote:
> >
> >> -----Original Message-----
> >> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> >> Daniele Ceraolo Spurio
> >> Sent: Tuesday, February 11, 2025 6:17 AM
> >> To: intel-xe at lists.freedesktop.org
> >> Cc: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>; Brost,
> >> Matthew <matthew.brost at intel.com>; Harrison, John C
> >> <john.c.harrison at intel.com>
> >> Subject: [PATCH] drm/xe/guc: Escalate GuC load failure immediately
> >>
> >> When the Xe was first introduced, we intentionally avoided escalating
> >> GuC load failures, to not abort mid-probe. Xe is now mature enough
> >> and we gracefully handle probe failures, so we can start escalating
> immediately.
> >>
> >> Note that even without this patch the probe is still aborted because
> >> the attempt to enable CTs after GuC load will fail and that failure
> >> is already escalated.
> >>
> >> Signed-off-by: Daniele Ceraolo Spurio
> >> <daniele.ceraolospurio at intel.com>
> >> Cc: Matthew Brost <matthew.brost at intel.com>
> >> Cc: John Harrison <John.C.Harrison at Intel.com>
> >> ---
> >>   drivers/gpu/drm/xe/xe_guc.c | 10 +++++++---
> >>   1 file changed, 7 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_guc.c
> >> b/drivers/gpu/drm/xe/xe_guc.c index
> >> 1619c0a52db9..13c3084c42c2 100644
> >> --- a/drivers/gpu/drm/xe/xe_guc.c
> >> +++ b/drivers/gpu/drm/xe/xe_guc.c
> >> @@ -938,7 +938,7 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc
> >> *guc_pc)  #endif
> >>   #define GUC_LOAD_TIME_WARN_MS      200
> >>
> >> -static void guc_wait_ucode(struct xe_guc *guc)
> >> +static int guc_wait_ucode(struct xe_guc *guc)
> >>   {
> >>   	struct xe_gt *gt = guc_to_gt(guc);
> >>   	struct xe_mmio *mmio = &gt->mmio;
> >> @@ -1045,6 +1045,8 @@ static void guc_wait_ucode(struct xe_guc *guc)
> >>   			  delta_ms, xe_guc_pc_get_act_freq(guc_pc),
> >> guc_pc_get_cur_freq(guc_pc),
> >>   			  before_freq, status, count);
> >>   	}
> >> +
> >> +	return load_done ? 0 : -EIO;
> > Do you see possibility of load_done to be -1 and you will return success
> here?
> 
> You're right, I'll change this to load_done == 1.

Ok with that fixed, LGTM, 
Reviewed-by: Tejas Upadhyay <tejas.upadhyay at intel.com>

Tejas
> 
> Daniele
> 
> >
> > Tejas
> >>   }
> >>
> >>   static int __xe_guc_upload(struct xe_guc *guc) @@ -1077,14 +1079,16
> >> @@ static int __xe_guc_upload(struct xe_guc *guc)
> >>   		goto out;
> >>
> >>   	/* Wait for authentication */
> >> -	guc_wait_ucode(guc);
> >> +	ret = guc_wait_ucode(guc);
> >> +	if (ret)
> >> +		goto out;
> >>
> >>   	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_RUNNING);
> >>   	return 0;
> >>
> >>   out:
> >>   	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_LOAD_FAIL);
> >> -	return 0	/* FIXME: ret, don't want to stop load currently */;
> >> +	return ret;
> >>   }
> >>
> >>   static int vf_guc_min_load_for_hwconfig(struct xe_guc *guc)
> >> --
> >> 2.43.0