[Intel-gfx] [PATCH] drm/i915/guc: Don't error on reset of banned context
Matthew Brost
matthew.brost at intel.com
Tue Jan 11 18:45:41 UTC 2022
On Thu, Jan 06, 2022 at 04:31:43PM -0800, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> There is a race (already documented in the code) whereby a context can
> be (re-)queued for submission at the same time as it is being banned
> due to a hang and reset. That leads to a hang/reset report from GuC
> for a context which i915 thinks is already banned.
>
I think there are 2 issues here.
1. Banning of context (e.g. user closes a non-persistent context)
results in an context reset. In this case we will receive a G2H
indicating a context reset and we want to convert the context reset to a
nop.
2. A GT reset races with a context reset result in the context getting
banned before the G2H is processed. Again we want to convert the context
reset to a nop. This race should be sealed when we can flush the G2H
handler in the reset path. Flushing G2H handler depends on the error
capture not allocating memory in non-sleeping contexts. Thomas H had a
patch for this.
In both cases we shouldn't print an error.
> While the race is indented to be fixed in a future GuC update, there
> is no actual harm beyond the wasted execution time of that new hang
> detection period. The context has already been banned for bad
> behaviour so a fresh hang is hardly surprising and certainly isn't
> going to be losing any work that wouldn't already have been lost if
> there was no race.
>
See above, I think you are confusing the issues here. This won't be
fixed by an updated GuC firmware.
> So don't treat this situation as an error. The error message is seen
> by the CI system as something fatal and causes test failures. Instead,
> just print an informational so the user at least knows a context reset
> occurred (given that the error capture is being skipped).
>
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> ---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9989d121127d..e8a32a7e7daf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3978,6 +3978,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> !context_blocked(ce))) {
> capture_error_state(guc, ce);
> guc_context_replay(ce);
> + } else if (intel_context_is_banned(ce)) {
> + drm_info(&guc_to_gt(guc)->i915->drm,
> + "Reset notificaion for banned context 0x%04X on %s",
> + ce->guc_id.id, ce->engine->name);
The context being blocking isn't an error either. I think real fix is
changing the below drm_err to drm_info and call it a day.
Matt
> } else {
> drm_err(&guc_to_gt(guc)->i915->drm,
> "Invalid GuC engine reset notificaion for 0x%04X on %s: banned = %d, blocked = %d",
> --
> 2.25.1
>
More information about the Intel-gfx
mailing list