[Intel-gfx] [PATCH v2] drm/i915/guc: add CAT error handler

Fri Oct 28 12:41:07 UTC 2022

On 28.10.2022 12:06, Tvrtko Ursulin wrote:
> 
> Hi,
> 
> I can't really provide feedback on the GuC interactions so only some 
> superficial comments below.
> 
> On 28/10/2022 10:34, Andrzej Hajda wrote:
>> Bad GPU memory accesses can result in catastrophic error notifications
>> being send from the GPU to the KMD via the GuC. Add a handler to process
>> the notification by printing a kernel message and dumping the related
>> engine state (if appropriate).
>> Since the same CAT error can be reported twice, log only 1st one and
>> assume error for the same context reported in less than 100ms after the
>> 1st one is duplicated.
>>
>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>> ---
>>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |  1 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  2 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  3 ++
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 47 +++++++++++++++++++
>>   4 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h 
>> b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index f359bef046e0b2..f9a1c5642855e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -138,6 +138,7 @@ enum intel_guc_action {
>>       INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>>       INTEL_GUC_ACTION_CLIENT_SOFT_RESET = 0x5507,
>>       INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>> +    INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR = 0x6000,
>>       INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION = 0x8002,
>>       INTEL_GUC_ACTION_NOTIFY_FLUSH_LOG_BUFFER_TO_FILE = 0x8003,
>>       INTEL_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED = 0x8004,
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 804133df1ac9b4..61b412732d095a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -445,6 +445,8 @@ int intel_guc_engine_failure_process_msg(struct 
>> intel_guc *guc,
>>                        const u32 *msg, u32 len);
>>   int intel_guc_error_capture_process_msg(struct intel_guc *guc,
>>                       const u32 *msg, u32 len);
>> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
>> +                    const u32 *msg, u32 len);
>>   struct intel_engine_cs *
>>   intel_guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 
>> instance);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> index 2b22065e87bf9a..f55f724e264407 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>> @@ -1035,6 +1035,9 @@ static int ct_process_request(struct 
>> intel_guc_ct *ct, struct ct_incoming_msg *r
>>           CT_ERROR(ct, "Received GuC exception notification!\n");
>>           ret = 0;
>>           break;
>> +    case INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR:
>> +        ret = intel_guc_cat_error_process_msg(guc, payload, len);
>> +        break;
>>       default:
>>           ret = -EOPNOTSUPP;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 693b07a977893d..f68ae4a0ad864d 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4659,6 +4659,53 @@ int intel_guc_engine_failure_process_msg(struct 
>> intel_guc *guc,
>>       return 0;
>>   }
>> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
>> +                    const u32 *msg, u32 len)
>> +{
>> +    static struct {
>> +        u32 ctx_id;
>> +        unsigned long after;
>> +    } ratelimit;
>> +    struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
>> +    struct drm_printer p = drm_info_printer(i915->drm.dev);
>> +    struct intel_context *ce;
>> +    unsigned long flags;
>> +    u32 ctx_id;
>> +
>> +    if (unlikely(len != 1)) {
>> +        drm_dbg(&i915->drm, "Invalid length %u\n", len);
>> +        return -EPROTO;
>> +    }
>> +    ctx_id = msg[0];
>> +
>> +    if (ctx_id == ratelimit.ctx_id && 
>> time_is_after_jiffies(ratelimit.after))
>> +        return 0;
> 
> This will be suboptimal with multi-gpu and multi-tile. Not sure if 
> ratelimiting is needed,

In many cases they comes in pairs:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12313/bat-adln-1/igt@i915_selftest@live@hugepages.html#dmesg-warnings510
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12304/bat-rpls-2/igt@i915_selftest@live@hugepages.html#dmesg-warnings514
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12298/bat-rpls-1/igt@i915_selftest@live@hugepages.html#dmesg-warnings553

And since CAT error usually (always???) means engine hang, they 
apparently both indicate the same error.

> but if it is, then perhaps move the state into 
> struct intel_guc?

Or intel_context or intel_engine_cs, then it logs will be avoided till 
reset of the engine?
It becomes complicated now, I just wanted only better log message :)

> 
> Would it be worth counting the rate limited ones and then log how many 
> were not logged when the next one is logged?

I have no idea why there are two messages, for me it looks like just 
some redundancy. engine state is exactly the same in both cases.

> 
> Should the condition be inverted - !time_is_after?

time_is_after_jiffies(arg) means that the time pointed by arg did not 
come yet (arg > jiffies, ie it is after jiffies), so it looks OK, but it 
is misleading.

> 
>> +
>> +    ratelimit.ctx_id = ctx_id;
>> +    ratelimit.after = jiffies + msecs_to_jiffies(100);
>> +
>> +    if (unlikely(ctx_id == -1)) {
>> +        drm_err(&i915->drm,
>> +            "GPU reported catastrophic error without providing valid 
>> context\n");
>> +        return 0;
>> +    }
>> +
>> +    xa_lock_irqsave(&guc->context_lookup, flags);
> 
> The only caller seems to be a worker so just _irq I guess. 
> ct_process_incoming_requests has the same issue but I haven't looked 
> into other handlers called from ct_process_request.
> 
>> +    ce = g2h_context_lookup(guc, ctx_id);
>> +    if (ce)
>> +        intel_context_get(ce);
>> +    xa_unlock_irqrestore(&guc->context_lookup, flags);
>> +    if (unlikely(!ce))
>> +        return -EPROTO;
> 
> EPROTO seems incorrect - message could have just been delayed and 
> context deregistered I think. Probably you just need to still log the 
> error just say context couldn't be resolved. GuC experts to confirm or 
> deny.

Copy/pasted from intel_guc_context_reset_process_msg, but you are right.
On the other hand hung context should not fly too far away.

> 
>> +
>> +    drm_err(&i915->drm, "GPU reported catastrophic error associated 
>> with context %u on %s\n",
>> +        ctx_id, ce->engine->name);
>> +    intel_engine_dump(ce->engine, &p, "%s\n", ce->engine->name);
> 
> Same as above, when CT channel is congested this can be delayed and then 
> I wonder what's the point of dumping engine state. In fact, even when CT 
> is not congested the delay could be significant enough for it to be 
> pointless. Another question for GuC experts I guess.
> 
> Also, check if intel_engine_dump can handle ce->engine being a virtual 
> engine.

I've added engine_dump to look at which instruction CAT occurred, did 
not help too much in my case, but maybe in other cases it can be helpful.
I though also about adding:
guc_log_context(p, ce)
guc_log_context_priority(p, ce)
Or even whole body of the loop in 
intel_guc_submission_print_context_info, after exporting it to some 
helper, but atm I have no idea if it can be helpfull.

Regards
Andrzej

> 
> Regards,
> 
> Tvrtko
> 
>> +    intel_context_put(ce);
>> +
>> +    return 0;
>> +}
>> +
>>   void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>>   {
>>       struct intel_guc *guc = &engine->gt->uc.guc;