[PATCH 2/4] drm/xe: declare wedged upon GuC load failure

Lucas De Marchi lucas.demarchi at intel.com
Tue Apr 16 19:13:39 UTC 2024


On Tue, Apr 16, 2024 at 03:05:46PM GMT, Rodrigo Vivi wrote:
>On Tue, Apr 09, 2024 at 06:15:05PM -0400, Rodrigo Vivi wrote:
>> Let's block the device upon any GuC load failure.
>> But let's continue with the probe so guc logs can be read
>> from the debugfs.
>>
>> v2: - s/wedged/busted
>>     - do not block probe or we lose guc_logs in debugfs (Matt)
>>
>> v3: - s/busted/wedged
>>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_guc.c | 42 ++++++++++++++++---------------------
>>  1 file changed, 18 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>> index 240e7a4bbff1..f1c3e338301d 100644
>> --- a/drivers/gpu/drm/xe/xe_guc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>> @@ -451,7 +451,7 @@ static int guc_xfer_rsa(struct xe_guc *guc)
>>  	return 0;
>>  }
>>
>> -static int guc_wait_ucode(struct xe_guc *guc)
>> +static void guc_wait_ucode(struct xe_guc *guc)
>>  {
>>  	struct xe_gt *gt = guc_to_gt(guc);
>>  	u32 status;
>> @@ -479,30 +479,26 @@ static int guc_wait_ucode(struct xe_guc *guc)
>>  			     200000, &status, false);
>>
>>  	if (ret) {
>> -		xe_gt_info(gt, "GuC load failed: status = 0x%08X\n", status);
>> -		xe_gt_info(gt, "GuC status: Reset = %u, BootROM = %#X, UKernel = %#X, MIA = %#X, Auth = %#X\n",
>> -			   REG_FIELD_GET(GS_MIA_IN_RESET, status),
>> -			   REG_FIELD_GET(GS_BOOTROM_MASK, status),
>> -			   REG_FIELD_GET(GS_UKERNEL_MASK, status),
>> -			   REG_FIELD_GET(GS_MIA_MASK, status),
>> -			   REG_FIELD_GET(GS_AUTH_STATUS_MASK, status));
>> -
>> -		if ((status & GS_BOOTROM_MASK) == GS_BOOTROM_RSA_FAILED) {
>> -			xe_gt_info(gt, "GuC firmware signature verification failed\n");
>> -			ret = -ENOEXEC;
>> -		}
>> +		xe_gt_err(gt, "GuC load failed: status = 0x%08X\n", status);
>> +		xe_gt_err(gt, "GuC status: Reset = %u, BootROM = %#X, UKernel = %#X, MIA = %#X, Auth = %#X\n",
>> +			  REG_FIELD_GET(GS_MIA_IN_RESET, status),
>> +			  REG_FIELD_GET(GS_BOOTROM_MASK, status),
>> +			  REG_FIELD_GET(GS_UKERNEL_MASK, status),
>> +			  REG_FIELD_GET(GS_MIA_MASK, status),
>> +			  REG_FIELD_GET(GS_AUTH_STATUS_MASK, status));
>> +
>> +		if ((status & GS_BOOTROM_MASK) == GS_BOOTROM_RSA_FAILED)
>> +			xe_gt_err(gt, "GuC firmware signature verification failed\n");
>>
>>  		if (REG_FIELD_GET(GS_UKERNEL_MASK, status) ==
>> -		    XE_GUC_LOAD_STATUS_EXCEPTION) {
>> -			xe_gt_info(gt, "GuC firmware exception. EIP: %#x\n",
>> -				   xe_mmio_read32(gt, SOFT_SCRATCH(13)));
>> -			ret = -ENXIO;
>> -		}
>> +		    XE_GUC_LOAD_STATUS_EXCEPTION)
>> +			xe_gt_err(gt, "GuC firmware exception. EIP: %#x\n",
>> +				  xe_mmio_read32(gt, SOFT_SCRATCH(13)));
>> +
>> +		xe_device_declare_wedged(gt_to_xe(gt));
>>  	} else {
>>  		xe_gt_dbg(gt, "GuC successfully loaded\n");
>>  	}
>> -
>> -	return ret;
>>  }
>>
>>  static int __xe_guc_upload(struct xe_guc *guc)
>> @@ -532,16 +528,14 @@ static int __xe_guc_upload(struct xe_guc *guc)
>>  		goto out;
>>
>>  	/* Wait for authentication */
>> -	ret = guc_wait_ucode(guc);
>> -	if (ret)
>> -		goto out;
>> +	guc_wait_ucode(guc);
>>
>>  	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_RUNNING);
>>  	return 0;
>>
>>  out:
>>  	xe_uc_fw_change_status(&guc->fw, XE_UC_FIRMWARE_LOAD_FAIL);
>> -	return 0	/* FIXME: ret, don't want to stop load currently */;
>> +	return ret;
>
>Lucas, thanks for the review. Just to let you know that I'm removing
>this chunk from this patch. Himal had noticed and warned me that
>this would change the behavior of other cases that are not touched
>or covered by this patch. i.e. if the guc_load fails on guc_xfer_rsa
>or xe_uc_fw_upload, we were not aboarting the probe, but now we are.
>
>So, let's remove this change from this patch for now so we can go
>ahead with this and then on top we see if we do the wedged on top
>of the rest and make this function a void case.
>
>Agree?

yeah... I saw his reply and responded as well. We either remove this
hunk or we make the other cases also wedge the device instead of failing


Lucas De Marchi

>
>>  }
>>
>>  /**
>> --
>> 2.44.0
>>


More information about the Intel-xe mailing list