[Intel-gfx] [PATCH] drm/i915/guc: Don't error on reset of banned context

Matthew Brost matthew.brost at intel.com
Tue Jan 11 18:45:41 UTC 2022


On Thu, Jan 06, 2022 at 04:31:43PM -0800, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
> 
> There is a race (already documented in the code) whereby a context can
> be (re-)queued for submission at the same time as it is being banned
> due to a hang and reset. That leads to a hang/reset report from GuC
> for a context which i915 thinks is already banned.
> 

I think there are 2 issues here.

1. Banning of context (e.g. user closes a non-persistent context)
results in an context reset. In this case we will receive a G2H
indicating a context reset and we want to convert the context reset to a
nop.

2. A GT reset races with a context reset result in the context getting
banned before the G2H is processed. Again we want to convert the context
reset to a nop. This race should be sealed when we can flush the G2H
handler in the reset path. Flushing G2H handler depends on the error
capture not allocating memory in non-sleeping contexts. Thomas H had a
patch for this.

In both cases we shouldn't print an error.

> While the race is indented to be fixed in a future GuC update, there
> is no actual harm beyond the wasted execution time of that new hang
> detection period. The context has already been banned for bad
> behaviour so a fresh hang is hardly surprising and certainly isn't
> going to be losing any work that wouldn't already have been lost if
> there was no race.
>

See above, I think you are confusing the issues here. This won't be
fixed by an updated GuC firmware.

> So don't treat this situation as an error. The error message is seen
> by the CI system as something fatal and causes test failures. Instead,
> just print an informational so the user at least knows a context reset
> occurred (given that the error capture is being skipped).
> 
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9989d121127d..e8a32a7e7daf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3978,6 +3978,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>  		   !context_blocked(ce))) {
>  		capture_error_state(guc, ce);
>  		guc_context_replay(ce);
> +	} else if (intel_context_is_banned(ce)) {
> +		drm_info(&guc_to_gt(guc)->i915->drm,
> +			 "Reset notificaion for banned context 0x%04X on %s",
> +			 ce->guc_id.id, ce->engine->name);

The context being blocking isn't an error either. I think real fix is
changing the below drm_err to drm_info and call it a day.

Matt

>  	} else {
>  		drm_err(&guc_to_gt(guc)->i915->drm,
>  			"Invalid GuC engine reset notificaion for 0x%04X on %s: banned = %d, blocked = %d",
> -- 
> 2.25.1
> 


More information about the Intel-gfx mailing list