[Intel-gfx] [PATCH v2] drm/i915/guc: add CAT error handler
Andrzej Hajda
andrzej.hajda at intel.com
Fri Oct 28 12:41:07 UTC 2022
On 28.10.2022 12:06, Tvrtko Ursulin wrote:
>
> Hi,
>
> I can't really provide feedback on the GuC interactions so only some
> superficial comments below.
>
> On 28/10/2022 10:34, Andrzej Hajda wrote:
>> Bad GPU memory accesses can result in catastrophic error notifications
>> being send from the GPU to the KMD via the GuC. Add a handler to process
>> the notification by printing a kernel message and dumping the related
>> engine state (if appropriate).
>> Since the same CAT error can be reported twice, log only 1st one and
>> assume error for the same context reported in less than 100ms after the
>> 1st one is duplicated.
>>
>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>> ---
>> .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h | 1 +
>> drivers/gpu/drm/i915/gt/uc/intel_guc.h | 2 +
>> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 3 ++
>> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 47 +++++++++++++++++++
>> 4 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index f359bef046e0b2..f9a1c5642855e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -138,6 +138,7 @@ enum intel_guc_action {
>> INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>> INTEL_GUC_ACTION_CLIENT_SOFT_RESET = 0x5507,
>> INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>> + INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR = 0x6000,
>> INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION = 0x8002,
>> INTEL_GUC_ACTION_NOTIFY_FLUSH_LOG_BUFFER_TO_FILE = 0x8003,
>> INTEL_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED = 0x8004,
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 804133df1ac9b4..61b412732d095a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -445,6 +445,8 @@ int intel_guc_engine_failure_process_msg(struct
>> intel_guc *guc,
>> const u32 *msg, u32 len);
>> int intel_guc_error_capture_process_msg(struct intel_guc *guc,
>> const u32 *msg, u32 len);
>> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
>> + const u32 *msg, u32 len);
>> struct intel_engine_cs *
>> intel_guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8
>> instance);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> index 2b22065e87bf9a..f55f724e264407 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> @@ -1035,6 +1035,9 @@ static int ct_process_request(struct
>> intel_guc_ct *ct, struct ct_incoming_msg *r
>> CT_ERROR(ct, "Received GuC exception notification!\n");
>> ret = 0;
>> break;
>> + case INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR:
>> + ret = intel_guc_cat_error_process_msg(guc, payload, len);
>> + break;
>> default:
>> ret = -EOPNOTSUPP;
>> break;
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 693b07a977893d..f68ae4a0ad864d 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4659,6 +4659,53 @@ int intel_guc_engine_failure_process_msg(struct
>> intel_guc *guc,
>> return 0;
>> }
>> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
>> + const u32 *msg, u32 len)
>> +{
>> + static struct {
>> + u32 ctx_id;
>> + unsigned long after;
>> + } ratelimit;
>> + struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
>> + struct drm_printer p = drm_info_printer(i915->drm.dev);
>> + struct intel_context *ce;
>> + unsigned long flags;
>> + u32 ctx_id;
>> +
>> + if (unlikely(len != 1)) {
>> + drm_dbg(&i915->drm, "Invalid length %u\n", len);
>> + return -EPROTO;
>> + }
>> + ctx_id = msg[0];
>> +
>> + if (ctx_id == ratelimit.ctx_id &&
>> time_is_after_jiffies(ratelimit.after))
>> + return 0;
>
> This will be suboptimal with multi-gpu and multi-tile. Not sure if
> ratelimiting is needed,
In many cases they comes in pairs:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12313/bat-adln-1/igt@i915_selftest@live@hugepages.html#dmesg-warnings510
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12304/bat-rpls-2/igt@i915_selftest@live@hugepages.html#dmesg-warnings514
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12298/bat-rpls-1/igt@i915_selftest@live@hugepages.html#dmesg-warnings553
And since CAT error usually (always???) means engine hang, they
apparently both indicate the same error.
> but if it is, then perhaps move the state into
> struct intel_guc?
Or intel_context or intel_engine_cs, then it logs will be avoided till
reset of the engine?
It becomes complicated now, I just wanted only better log message :)
>
> Would it be worth counting the rate limited ones and then log how many
> were not logged when the next one is logged?
I have no idea why there are two messages, for me it looks like just
some redundancy. engine state is exactly the same in both cases.
>
> Should the condition be inverted - !time_is_after?
time_is_after_jiffies(arg) means that the time pointed by arg did not
come yet (arg > jiffies, ie it is after jiffies), so it looks OK, but it
is misleading.
>
>> +
>> + ratelimit.ctx_id = ctx_id;
>> + ratelimit.after = jiffies + msecs_to_jiffies(100);
>> +
>> + if (unlikely(ctx_id == -1)) {
>> + drm_err(&i915->drm,
>> + "GPU reported catastrophic error without providing valid
>> context\n");
>> + return 0;
>> + }
>> +
>> + xa_lock_irqsave(&guc->context_lookup, flags);
>
> The only caller seems to be a worker so just _irq I guess.
> ct_process_incoming_requests has the same issue but I haven't looked
> into other handlers called from ct_process_request.
>
>> + ce = g2h_context_lookup(guc, ctx_id);
>> + if (ce)
>> + intel_context_get(ce);
>> + xa_unlock_irqrestore(&guc->context_lookup, flags);
>> + if (unlikely(!ce))
>> + return -EPROTO;
>
> EPROTO seems incorrect - message could have just been delayed and
> context deregistered I think. Probably you just need to still log the
> error just say context couldn't be resolved. GuC experts to confirm or
> deny.
Copy/pasted from intel_guc_context_reset_process_msg, but you are right.
On the other hand hung context should not fly too far away.
>
>> +
>> + drm_err(&i915->drm, "GPU reported catastrophic error associated
>> with context %u on %s\n",
>> + ctx_id, ce->engine->name);
>> + intel_engine_dump(ce->engine, &p, "%s\n", ce->engine->name);
>
> Same as above, when CT channel is congested this can be delayed and then
> I wonder what's the point of dumping engine state. In fact, even when CT
> is not congested the delay could be significant enough for it to be
> pointless. Another question for GuC experts I guess.
>
> Also, check if intel_engine_dump can handle ce->engine being a virtual
> engine.
I've added engine_dump to look at which instruction CAT occurred, did
not help too much in my case, but maybe in other cases it can be helpful.
I though also about adding:
guc_log_context(p, ce)
guc_log_context_priority(p, ce)
Or even whole body of the loop in
intel_guc_submission_print_context_info, after exporting it to some
helper, but atm I have no idea if it can be helpfull.
Regards
Andrzej
>
> Regards,
>
> Tvrtko
>
>> + intel_context_put(ce);
>> +
>> + return 0;
>> +}
>> +
>> void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>> {
>> struct intel_guc *guc = &engine->gt->uc.guc;
More information about the Intel-gfx
mailing list