[Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low level driver errors
Upadhyay, Tejas
tejas.upadhyay at intel.com
Mon Oct 16 15:03:29 UTC 2023
> -----Original Message-----
> From: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> Sent: Monday, October 16, 2023 8:05 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: Vishwanathapura at freedesktop.org; Iddamsetty, Aravind
> <aravind.iddamsetty at intel.com>
> Subject: Re: [Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low level
> driver errors
>
>
>
> On 16.10.2023 14:36, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> >> Sent: Friday, October 13, 2023 8:43 PM
> >> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Cc: Vishwanathapura at freedesktop.org
> >> Subject: Re: [Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low
> >> level driver errors
> >>
> >>
> >>
> >> On 13.10.2023 14:16, Tejas Upadhyay wrote:
> >>> we added a low level driver error counter and incrementing on each
> >>> occurrance. Focus is on errors that are not functionally
> >>
> >> typo
> >
> > Sure
> >
> >>
> >>> affecting the system and might otherwise go unnoticed and cause
> >>> power/performance regressions, so checking for the error counters
> >>> should help.
> >>>
> >>> Importantly the intention is not to go adding new error checks, but
> >>> to make sure the existing important error conditions are propagated
> >>> in terms of counter under respective categories like below :
> >>> Under GT:
> >>> gt_guc_communication,
> >>> gt_engine_other,
> >>> gt_other
> >>>
> >>> Under Tile:
> >>> gtt,
> >>> interrupt
> >>>
> >>> TODO: Currently this is just a counting of errors, later these
> >>> counters will be reported through netlink interface when it is
> >>> implemented and ready.
> >>>
> >>> V6:
> >>> - move drm_err to gt and tile specific err API - Aravind
> >>> - Use GTT naming instead of GGTT - Aravind/Niranjana
> >>> V5:
> >>> - Dump err_type in string format
> >>> V4:
> >>> - dump err_type in drm_err log - Himal
> >>> V2:
> >>> - Use modified APIs
> >>>
> >>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>> ---
> >>> drivers/gpu/drm/xe/xe_gt.c | 25 +++++++++++-
> >>> drivers/gpu/drm/xe/xe_gt.h | 4 +-
> >>> drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 13 +++---
> >>> drivers/gpu/drm/xe/xe_guc.c | 15 ++++---
> >>> drivers/gpu/drm/xe/xe_guc_ct.c | 44 +++++++++++----------
> >>> drivers/gpu/drm/xe/xe_guc_pc.c | 12 +++---
> >>> drivers/gpu/drm/xe/xe_guc_submit.c | 39 ++++++++++--------
> >>> drivers/gpu/drm/xe/xe_irq.c | 6 ++-
> >>> drivers/gpu/drm/xe/xe_reg_sr.c | 22 +++++------
> >>> drivers/gpu/drm/xe/xe_tile.c | 26 +++++++++++-
> >>> drivers/gpu/drm/xe/xe_tile.h | 4 +-
> >>> 11 files changed, 139 insertions(+), 71 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> >>> index b81de3984626..6f875b3886fd 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt.c
> >>> +++ b/drivers/gpu/drm/xe/xe_gt.c
> >>> @@ -47,10 +47,18 @@
> >>> #include "xe_wa.h"
> >>> #include "xe_wopcm.h"
> >>>
> >>> +const char *const xe_gt_drv_err_to_str[] = {
> >>> + [XE_GT_DRV_ERR_GUC_COMM] = "GUC COMMUNICATION",
> >>> + [XE_GT_DRV_ERR_ENGINE] = "ENGINE OTHER",
> >>> + [XE_GT_DRV_ERR_OTHERS] = "GT OTHER"
> >>> +};
> >>
> >> instead of exposing table, better BKM is to define public helper
> >> function that converts enum to string, and keep such table as
> >> implementation detail
> >
> > It always better and suggested upstream approach to statically bind values
> in array like this above. Also having so many switch cases degrades code
> execution. So in my opinion what we have above should work well.
>
> but you can still use table as map from enum to string my point is that
> exposing a raw table (without guard code) may lead to bugs, while
> encapsulating that inside a helper will be more robust
>
> also we don't expect these errors (and thus err strings) to happen frequently,
> so function call cost could be ignored
>
> >
> >>
> >>> +
> >>> /**
> >>> * xe_gt_report_driver_error - Count driver error for gt
> >>> * @gt: GT to count error for
> >>> * @err: enum error type
> >>> + * @fmt: debug message format to print error
> >>> + * @...: variable args to print error
> >>> *
> >>> * Increment the driver error counter in respective error
> >>> * category for this GT.
> >>> @@ -58,11 +66,26 @@
> >>> * Returns void.
> >>> */
> >>> void xe_gt_report_driver_error(struct xe_gt *gt,
> >>> - const enum xe_gt_drv_err_type err)
> >>> + const enum xe_gt_drv_err_type err,
> >>> + const char *fmt, ...)
> >>> {
> >>> + struct va_format vaf;
> >>> + va_list args;
> >>> +
> >>> xe_gt_assert(gt, err >= ARRAY_SIZE(gt->drv_err_cnt));
> >>> WRITE_ONCE(gt->drv_err_cnt[err],
> >>> READ_ONCE(gt->drv_err_cnt[err]) + 1);
> >>> +
> >>> + va_start(args, fmt);
> >>> + vaf.fmt = fmt;
> >>> + vaf.va = &args;
> >>> +
> >>> + BUILD_BUG_ON(ARRAY_SIZE(xe_gt_drv_err_to_str) !=
> >>> + XE_GT_DRV_ERR_MAX);
> >>> +
> >>> + drm_err(>_to_xe(gt)->drm, "GT%u [%s] %pV",
> >>> + gt->info.id, xe_gt_drv_err_to_str[err], &vaf);
> >>
> >> why not using existing xe_gt_err() which already adds "GT%u" prefix?
> >
> > Ok makes sense I will use xe_gt_err for GT error reporting.
> >
> >>
> >> also if you want to include error type string here, then maybe its
> >> worth to consider their shorter names ?
> >>
> >> "GUC COMMUNICATION" -> "GUC"
> >> "ENGINE OTHER" -> "ENGINE"
> >> "GT OTHER" -> "OTHER"
> >
> > No the names gives information that its GUC communication not GUC
> loading failures that we are going to report here. Similar for engine we have
> reset failures which we are not reporting instead we report other engine
> performance related errors. Similarly for GT no reset errors, only GT perf
> errors. So naming here is meaningful.
> >
> >>
> >> as otherwise they will be quite long:
> >>
> >> [ ] xe 0000:b7:00.0: [drm] *ERROR* GT0: [GT OTHER] ...
> >> [ ] xe 0000:b7:00.0: [drm] *ERROR* GT0: [GUC COMMUNICATION] ...
> >>
> >>> + va_end(args);
> >>> }
> >>>
> >>> struct xe_gt *xe_gt_alloc(struct xe_tile *tile) diff --git
> >>> a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h index
> >>> 9442d615042f..374c6946620b 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt.h
> >>> +++ b/drivers/gpu/drm/xe/xe_gt.h
> >>> @@ -67,7 +67,9 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt
> >>> *gt,
> >> struct xe_hw_engine *hwe)
> >>> hwe->instance == gt->usm.reserved_bcs_instance; }
> >>>
> >>> +extern const char *const xe_gt_drv_err_to_str[];
> >>> void xe_gt_report_driver_error(struct xe_gt *gt,
> >>> - const enum xe_gt_drv_err_type err);
> >>> + const enum xe_gt_drv_err_type err,
> >>> + const char *fmt, ...);
> >>>
> >>> #endif
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>> b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>> index bd6005b9d498..770ed7df8389 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>> @@ -9,6 +9,7 @@
> >>> #include "xe_gt.h"
> >>> #include "xe_guc.h"
> >>> #include "xe_guc_ct.h"
> >>> +#include "xe_tile.h"
> >>> #include "xe_trace.h"
> >>>
> >>> #define TLB_TIMEOUT (HZ / 4)
> >>> @@ -35,8 +36,10 @@ static void xe_gt_tlb_fence_timeout(struct
> >> work_struct *work)
> >>> break;
> >>>
> >>> trace_xe_gt_tlb_invalidation_fence_timeout(fence);
> >>> - drm_err(>_to_xe(gt)->drm, "gt%d: TLB invalidation fence
> >> timeout, seqno=%d recv=%d",
> >>> - gt->info.id, fence->seqno, gt-
> >>> tlb_invalidation.seqno_recv);
> >>> + xe_tile_report_driver_error(gt_to_tile(gt),
> >> XE_TILE_DRV_ERR_GTT,
> >>> + "gt%d: TLB invalidation fence
> >> timeout, seqno=%d recv=%d",
> >>
> >> hmm, TLB invalidation code seems to be operating on the GT/GuC level,
> >> and timeouts here will have origin in requests sent (but lost or
> >> something) to GuC, so why suddenly we report this as a tile error ?
> >>
> >> note that this will also make message inconsistent:
> >> gt%u here vs GT%u elsewhere
> >
> > GGTT is per tile and PPGTT is per GT. I will make sure to see if further
> categorization of GGTT vs PPGTT needed while reporting error here.
> >
> >>
> >>> + gt->info.id, fence->seqno,
> >>> + gt->tlb_invalidation.seqno_recv);
> >>>
> >>> list_del(&fence->link);
> >>> fence->base.error = -ETIME;
> >>> @@ -317,7 +320,6 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
> >>> */
> >>> int xe_gt_tlb_invalidation_wait(struct xe_gt *gt, int seqno) {
> >>> - struct xe_device *xe = gt_to_xe(gt);
> >>> struct xe_guc *guc = >->uc.guc;
> >>> int ret;
> >>>
> >>> @@ -329,8 +331,9 @@ int xe_gt_tlb_invalidation_wait(struct xe_gt
> >>> *gt, int
> >> seqno)
> >>> tlb_invalidation_seqno_past(gt, seqno),
> >>> TLB_TIMEOUT);
> >>> if (!ret) {
> >>> - drm_err(&xe->drm, "gt%d: TLB invalidation time'd out,
> >> seqno=%d, recv=%d\n",
> >>> - gt->info.id, seqno, gt->tlb_invalidation.seqno_recv);
> >>> + xe_tile_report_driver_error(gt_to_tile(gt),
> >> XE_TILE_DRV_ERR_GTT,
> >>> + "gt%d: TLB invalidation time'd out,
> >> seqno=%d, recv=%d\n",
> >>> + gt->info.id, seqno, gt-
> >>> tlb_invalidation.seqno_recv);
> >>> return -ETIME;
> >>> }
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_guc.c
> >>> b/drivers/gpu/drm/xe/xe_guc.c index 84f0b5488783..c40d0cc46293
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_guc.c
> >>> +++ b/drivers/gpu/drm/xe/xe_guc.c
> >>> @@ -663,8 +663,9 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc,
> >> const u32 *request,
> >>> 50000, &reply, false);
> >>> if (ret) {
> >>> timeout:
> >>> - drm_err(&xe->drm, "mmio request %#x: no reply %#x\n",
> >>> - request[0], reply);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_GUC_COMM,
> >>> + "mmio request %#x: no reply
> >> %#x\n",
> >>> + request[0], reply);
> >>
> >> maybe it's worth to add helper:
> >>
> >> #define xe_guc_report_error(guc, fmt...) \
> >> xe_gt_report_driver_error( \
> >> guc_to_gt(guc), \
> >> XE_GT_DRV_ERR_GUC_COMM, fmt)
> >
> > I think its not needed, anyway we will call xe_gt_report_driver_error under
> it. Also if there will be more such categories like "guc" then we will keep on
> adding helpers.
> >
> >>
> >>> return ret;
> >>> }
> >>>
> >>> @@ -697,16 +698,18 @@ int xe_guc_mmio_send_recv(struct xe_guc
> *guc,
> >> const u32 *request,
> >>> u32 hint = FIELD_GET(GUC_HXG_FAILURE_MSG_0_HINT,
> >> header);
> >>> u32 error = FIELD_GET(GUC_HXG_FAILURE_MSG_0_ERROR,
> >> header);
> >>>
> >>> - drm_err(&xe->drm, "mmio request %#x: failure %#x/%#x\n",
> >>> - request[0], error, hint);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_GUC_COMM,
> >>> + "mmio request %#x: failure
> >> %#x/%#x\n",
> >>> + request[0], error, hint);
> >>> return -ENXIO;
> >>> }
> >>>
> >>> if (FIELD_GET(GUC_HXG_MSG_0_TYPE, header) !=
> >>> GUC_HXG_TYPE_RESPONSE_SUCCESS) {
> >>> proto:
> >>> - drm_err(&xe->drm, "mmio request %#x: unexpected reply
> >> %#x\n",
> >>> - request[0], header);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_GUC_COMM,
> >>> + "mmio request %#x: unexpected
> >> reply %#x\n",
> >>> + request[0], header);
> >>> return -EPROTO;
> >>> }
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> >>> b/drivers/gpu/drm/xe/xe_guc_ct.c index 8b686c8b3339..1615f4e3c01e
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> >>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> >>> @@ -732,8 +732,9 @@ static int guc_ct_send_recv(struct xe_guc_ct
> >>> *ct, const u32 *action, u32 len,
> >>>
> >>> ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> >>> if (!ret) {
> >>> - drm_err(&xe->drm, "Timed out wait for G2H, fence %u,
> >> action %04x",
> >>> - g2h_fence.seqno, action[0]);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Timed out wait for G2H, fence %u,
> >> action %04x",
> >>> + g2h_fence.seqno, action[0]);
> >>> xa_erase_irq(&ct->fence_lookup, g2h_fence.seqno);
> >>> return -ETIME;
> >>> }
> >>> @@ -744,8 +745,9 @@ static int guc_ct_send_recv(struct xe_guc_ct
> >>> *ct,
> >> const u32 *action, u32 len,
> >>> goto retry;
> >>> }
> >>> if (g2h_fence.fail) {
> >>> - drm_err(&xe->drm, "Send failed, action 0x%04x, error %d,
> >> hint %d",
> >>> - action[0], g2h_fence.error, g2h_fence.hint);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Send failed, action 0x%04x, error
> >> %d, hint %d",
> >>> + action[0], g2h_fence.error,
> >> g2h_fence.hint);
> >>> ret = -EIO;
> >>> }
> >>>
> >>> @@ -829,7 +831,6 @@ static int parse_g2h_response(struct xe_guc_ct
> >>> *ct, u32 *msg, u32 len)
> >>>
> >>> static int parse_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len) {
> >>> - struct xe_device *xe = ct_to_xe(ct);
> >>> u32 hxg, origin, type;
> >>> int ret;
> >>>
> >>> @@ -839,9 +840,9 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>> u32 *msg, u32 len)
> >>>
> >>> origin = FIELD_GET(GUC_HXG_MSG_0_ORIGIN, hxg);
> >>> if (unlikely(origin != GUC_HXG_ORIGIN_GUC)) {
> >>> - drm_err(&xe->drm,
> >>> - "G2H channel broken on read, origin=%d, reset
> >> required\n",
> >>> - origin);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "G2H channel broken on read,
> >> origin=%d, reset required\n",
> >>> + origin);
> >>> ct->ctbs.g2h.info.broken = true;
> >>>
> >>> return -EPROTO;
> >>> @@ -858,9 +859,9 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>> u32
> >> *msg, u32 len)
> >>> ret = parse_g2h_response(ct, msg, len);
> >>> break;
> >>> default:
> >>> - drm_err(&xe->drm,
> >>> - "G2H channel broken on read, type=%d, reset
> >> required\n",
> >>> - type);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "G2H channel broken on read,
> >> type=%d, reset required\n",
> >>> + type);
> >>> ct->ctbs.g2h.info.broken = true;
> >>>
> >>> ret = -EOPNOTSUPP;
> >>> @@ -871,7 +872,6 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>> u32 *msg, u32 len)
> >>>
> >>> static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
> >>> {
> >>> - struct xe_device *xe = ct_to_xe(ct);
> >>> struct xe_guc *guc = ct_to_guc(ct);
> >>> u32 action = FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, msg[1]);
> >>> u32 *payload = msg + GUC_CTB_HXG_MSG_MIN_LEN; @@ -918,12
> >> +918,15 @@
> >>> static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
> >>> adj_len);
> >>> break;
> >>> default:
> >>> - drm_err(&xe->drm, "unexpected action 0x%04x\n", action);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "unexpected action 0x%04x\n",
> >>> + action);
> >>> }
> >>>
> >>> if (ret)
> >>> - drm_err(&xe->drm, "action 0x%04x failed processing,
> >> ret=%d\n",
> >>> - action, ret);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "action 0x%04x failed processing,
> >> ret=%d\n",
> >>> + action, ret);
> >>>
> >>> return 0;
> >>> }
> >>> @@ -957,9 +960,9 @@ static int g2h_read(struct xe_guc_ct *ct, u32
> >>> *msg,
> >> bool fast_path)
> >>> sizeof(u32));
> >>> len = FIELD_GET(GUC_CTB_MSG_0_NUM_DWORDS, msg[0]) +
> >> GUC_CTB_MSG_MIN_LEN;
> >>> if (len > avail) {
> >>> - drm_err(&xe->drm,
> >>> - "G2H channel broken on read, avail=%d, len=%d,
> >> reset required\n",
> >>> - avail, len);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "G2H channel broken on read,
> >> avail=%d, len=%d, reset required\n",
> >>> + avail, len);
> >>> g2h->info.broken = true;
> >>>
> >>> return -EPROTO;
> >>> @@ -1027,8 +1030,9 @@ static void g2h_fast_path(struct xe_guc_ct
> >>> *ct, u32
> >> *msg, u32 len)
> >>> }
> >>>
> >>> if (ret)
> >>> - drm_err(&xe->drm, "action 0x%04x failed processing,
> >> ret=%d\n",
> >>> - action, ret);
> >>> + xe_gt_report_driver_error(ct_to_gt(ct),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "action 0x%04x failed processing,
> >> ret=%d\n",
> >>> + action, ret);
> >>> }
> >>>
> >>> /**
> >>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
> >>> b/drivers/gpu/drm/xe/xe_guc_pc.c index d9375d1d582f..1ee7b68a9fc4
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> >>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> >>> @@ -197,9 +197,9 @@ static int pc_action_query_task_state(struct
> >> xe_guc_pc *pc)
> >>> /* Blocking here to ensure the results are ready before reading
> >>> them
> >> */
> >>> ret = xe_guc_ct_send_block(ct, action, ARRAY_SIZE(action));
> >>> if (ret)
> >>> - drm_err(&pc_to_xe(pc)->drm,
> >>> - "GuC PC query task state failed: %pe", ERR_PTR(ret));
> >>> -
> >>> + xe_gt_report_driver_error(pc_to_gt(pc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "GuC PC query task state failed:
> >> %pe",
> >>> + ERR_PTR(ret));
> >>
> >> are we sure we want to count same problem several times?
> >>
> >> note that xe_guc_ct_send_block() will likely already report that as
> >> "GUC COMM" error, should we do it again here ? or maybe this should
> >> be different category/class ?
> >
> > I will double check if double counting is done here. Ideally I would make
> sure that 1 counting happens at end for 1 path of error.
> >
> >>
> >>> return ret;
> >>> }
> >>>
> >>> @@ -219,9 +219,9 @@ static int pc_action_set_param(struct xe_guc_pc
> >>> *pc, u8 id, u32 value)
> >>>
> >>> ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> >>> if (ret)
> >>> - drm_err(&pc_to_xe(pc)->drm, "GuC PC set param failed:
> >> %pe",
> >>> - ERR_PTR(ret));
> >>> -
> >>> + xe_gt_report_driver_error(pc_to_gt(pc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "GuC PC set param failed: %pe",
> >>> + ERR_PTR(ret));
> >>> return ret;
> >>> }
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> >>> b/drivers/gpu/drm/xe/xe_guc_submit.c
> >>> index 870dc5c532fa..4b7fe6419d48 100644
> >>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> >>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> >>> @@ -1497,13 +1497,15 @@ g2h_exec_queue_lookup(struct xe_guc *guc,
> >> u32 guc_id)
> >>> struct xe_exec_queue *q;
> >>>
> >>> if (unlikely(guc_id >= GUC_ID_MAX)) {
> >>> - drm_err(&xe->drm, "Invalid guc_id %u", guc_id);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid guc_id %u", guc_id);
> >>> return NULL;
> >>> }
> >>>
> >>> q = xa_load(&guc->submission_state.exec_queue_lookup, guc_id);
> >>> if (unlikely(!q)) {
> >>> - drm_err(&xe->drm, "Not engine present for guc_id %u",
> >> guc_id);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Not engine present for guc_id %u",
> >> guc_id);
> >>> return NULL;
> >>> }
> >>>
> >>> @@ -1527,12 +1529,12 @@ static void deregister_exec_queue(struct
> >>> xe_guc *guc, struct xe_exec_queue *q)
> >>>
> >>> int xe_guc_sched_done_handler(struct xe_guc *guc, u32 *msg, u32
> >>> len) {
> >>> - struct xe_device *xe = guc_to_xe(guc);
> >>> struct xe_exec_queue *q;
> >>> u32 guc_id = msg[0];
> >>>
> >>> if (unlikely(len < 2)) {
> >>> - drm_err(&xe->drm, "Invalid length %u", len);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid length %u", len);
> >>
> >> nit: IMO such error messages are redundant as CTB will report failed
> >> action anyway (including message dump, where invalid length could be
> >> spotted)
> >
> > Can you please elaborate more here please?
>
> ok, so it looks that CTB implementation in Xe is less verbose than on i915,
> where all errors were correctly propagated and related message was included
> in the log
>
> on Xe this seems to be partially broken, as dequeue_one_g2h() always
> returns 0 while g2h_worker_func() is prepared to kick reset on EPROTO:
>
> g2h_worker_func
> dequeue_one_g2h
> process_g2h_msg(msg, len)
> xe_guc_sched_done_handler(msg, len)
> drm_err(..."Invalid length
> return -EPROTO
> if (ret)
> drm_err("action 0x%04x failed processing, ret=%d\n",
> return 0;
>
> if (ret == -EPROTO...) {
> xe_guc_ct_print(ct, &p, false);
> kick_reset(ct);
>
> Note than once EPROTO errors will be correctly propagated, then it will trigger
> a reset, thus it will conflict with your statement above that this error tracking
> does include only perf errors.
>
> "Similarly for GT no reset errors, only GT perf errors"
In that case I need to remove reporting and counting errors here. I will double check this.
>
> >
> >>
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1542,8 +1544,9 @@ int xe_guc_sched_done_handler(struct xe_guc
> >>> *guc, u32 *msg, u32 len)
> >>>
> >>> if (unlikely(!exec_queue_pending_enable(q) &&
> >>> !exec_queue_pending_disable(q))) {
> >>> - drm_err(&xe->drm, "Unexpected engine state 0x%04x",
> >>> - atomic_read(&q->guc->state));
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Unexpected engine state 0x%04x",
> >>> + atomic_read(&q->guc->state));
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1572,12 +1575,12 @@ int xe_guc_sched_done_handler(struct
> xe_guc
> >>> *guc, u32 *msg, u32 len)
> >>>
> >>> int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg,
> >>> u32
> >>> len) {
> >>> - struct xe_device *xe = guc_to_xe(guc);
> >>> struct xe_exec_queue *q;
> >>> u32 guc_id = msg[0];
> >>>
> >>> if (unlikely(len < 1)) {
> >>> - drm_err(&xe->drm, "Invalid length %u", len);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid length %u", len);
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1587,8 +1590,9 @@ int xe_guc_deregister_done_handler(struct
> >> xe_guc
> >>> *guc, u32 *msg, u32 len)
> >>>
> >>> if (!exec_queue_destroyed(q) || exec_queue_pending_disable(q) ||
> >>> exec_queue_pending_enable(q) || exec_queue_enabled(q)) {
> >>> - drm_err(&xe->drm, "Unexpected engine state 0x%04x",
> >>> - atomic_read(&q->guc->state));
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Unexpected engine state 0x%04x",
> >>> + atomic_read(&q->guc->state));
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1611,7 +1615,8 @@ int xe_guc_exec_queue_reset_handler(struct
> >> xe_guc *guc, u32 *msg, u32 len)
> >>> u32 guc_id = msg[0];
> >>>
> >>> if (unlikely(len < 1)) {
> >>> - drm_err(&xe->drm, "Invalid length %u", len);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid length %u", len);
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1646,7 +1651,8 @@ int
> >> xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32
> >> *msg,
> >>> u32 guc_id = msg[0];
> >>>
> >>> if (unlikely(len < 1)) {
> >>> - drm_err(&xe->drm, "Invalid length %u", len);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid length %u", len);
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1667,12 +1673,12 @@ int
> >>> xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32
> >>> *msg,
> >>>
> >>> int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32
> >>> *msg, u32 len) {
> >>> - struct xe_device *xe = guc_to_xe(guc);
> >>> u8 guc_class, instance;
> >>> u32 reason;
> >>>
> >>> if (unlikely(len != 3)) {
> >>> - drm_err(&xe->drm, "Invalid length %u", len);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc),
> >> XE_GT_DRV_ERR_GUC_COMM,
> >>> + "Invalid length %u", len);
> >>> return -EPROTO;
> >>> }
> >>>
> >>> @@ -1681,8 +1687,9 @@ int
> >> xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32 *msg,
> >> u32 le
> >>> reason = msg[2];
> >>>
> >>> /* Unexpected failure of a hardware feature, log an actual error */
> >>> - drm_err(&xe->drm, "GuC engine reset request failed on %d:%d
> >> because 0x%08X",
> >>> - guc_class, instance, reason);
> >>> + xe_gt_report_driver_error(guc_to_gt(guc), XE_GT_DRV_ERR_ENGINE,
> >>> + "GuC engine reset request failed on %d:%d
> >> because 0x%08X",
> >>> + guc_class, instance, reason);
> >>>
> >>> xe_gt_reset_async(guc_to_gt(guc));
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_irq.c
> >>> b/drivers/gpu/drm/xe/xe_irq.c index def9369eb488..2d00baf2c127
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_irq.c
> >>> +++ b/drivers/gpu/drm/xe/xe_irq.c
> >>> @@ -18,6 +18,7 @@
> >>> #include "xe_guc.h"
> >>> #include "xe_hw_engine.h"
> >>> #include "xe_mmio.h"
> >>> +#include "xe_tile.h"
> >>>
> >>> /*
> >>> * Interrupt registers for a unit are always consecutive and
> >>> ordered @@ -222,8 +223,9 @@ gt_engine_identity(struct xe_device *xe,
> >>> !time_after32(local_clock() >> 10, timeout_ts));
> >>>
> >>> if (unlikely(!(ident & INTR_DATA_VALID))) {
> >>> - drm_err(&xe->drm, "INTR_IDENTITY_REG%u:%u 0x%08x not
> >> valid!\n",
> >>> - bank, bit, ident);
> >>> + xe_tile_report_driver_error(gt_to_tile(mmio),
> >> XE_TILE_DRV_ERR_INTR,
> >>> + "INTR_IDENTITY_REG%u:%u
> >> 0x%08x not valid!\n",
> >>> + bank, bit, ident);
> >>> return 0;
> >>> }
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_reg_sr.c
> >>> b/drivers/gpu/drm/xe/xe_reg_sr.c index 87adefb56024..9ee62ff209eb
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_reg_sr.c
> >>> +++ b/drivers/gpu/drm/xe/xe_reg_sr.c
> >>> @@ -125,12 +125,12 @@ int xe_reg_sr_add(struct xe_reg_sr *sr,
> >>> return 0;
> >>>
> >>> fail:
> >>> - xe_gt_err(gt,
> >>> - "discarding save-restore reg %04lx (clear: %08x, set: %08x,
> >> masked: %s, mcr: %s): ret=%d\n",
> >>> - idx, e->clr_bits, e->set_bits,
> >>> - str_yes_no(e->reg.masked),
> >>> - str_yes_no(e->reg.mcr),
> >>> - ret);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS,
> >>> + "discarding save-restore reg %04lx (clear:
> >> %08x, set: %08x, masked: %s, mcr: %s): ret=%d\n",
> >>> + idx, e->clr_bits, e->set_bits,
> >>> + str_yes_no(e->reg.masked),
> >>> + str_yes_no(e->reg.mcr),
> >>> + ret);
> >>> reg_sr_inc_error(sr);
> >>>
> >>> return ret;
> >>> @@ -207,7 +207,7 @@ void xe_reg_sr_apply_mmio(struct xe_reg_sr *sr,
> >> struct xe_gt *gt)
> >>> return;
> >>>
> >>> err_force_wake:
> >>> - xe_gt_err(gt, "Failed to apply, err=%d\n", err);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS, "Failed to
> >>> +apply, err=%d\n", err);
> >>> }
> >>>
> >>> void xe_reg_sr_apply_whitelist(struct xe_hw_engine *hwe) @@ -234,9
> >>> +234,9 @@ void xe_reg_sr_apply_whitelist(struct xe_hw_engine *hwe)
> >>> p = drm_debug_printer(KBUILD_MODNAME);
> >>> xa_for_each(&sr->xa, reg, entry) {
> >>> if (slot == RING_MAX_NONPRIV_SLOTS) {
> >>> - xe_gt_err(gt,
> >>> - "hwe %s: maximum register whitelist slots
> >> (%d) reached, refusing to add more\n",
> >>> - hwe->name, RING_MAX_NONPRIV_SLOTS);
> >>> + xe_gt_report_driver_error(gt,
> >> XE_GT_DRV_ERR_ENGINE,
> >>> + "hwe %s: maximum register
> >> whitelist slots (%d) reached, refusing to add more\n",
> >>> + hwe->name,
> >> RING_MAX_NONPRIV_SLOTS);
> >>> break;
> >>> }
> >>>
> >>> @@ -259,7 +259,7 @@ void xe_reg_sr_apply_whitelist(struct
> >>> xe_hw_engine
> >> *hwe)
> >>> return;
> >>>
> >>> err_force_wake:
> >>> - drm_err(&xe->drm, "Failed to apply, err=%d\n", err);
> >>> + xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS, "Failed to
> >>> +apply, err=%d\n", err);
> >>> }
> >>>
> >>> /**
> >>> diff --git a/drivers/gpu/drm/xe/xe_tile.c
> >>> b/drivers/gpu/drm/xe/xe_tile.c index 708dd385f2b1..0f0ec8f33a51
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_tile.c
> >>> +++ b/drivers/gpu/drm/xe/xe_tile.c
> >>> @@ -71,10 +71,17 @@
> >>> * - MOCS and PAT programming
> >>> */
> >>>
> >>> +const char *const xe_tile_drv_err_to_str[] = {
> >>> + [XE_TILE_DRV_ERR_GTT] = "GTT",
> >>> + [XE_TILE_DRV_ERR_INTR] = "INTERRUPT"
> >>
> >> "IRQ" ?
> >
> > These are furious reported interrupts errors which does not have handlers,
> so naming is correct.
> >
> >>
> >>> +};
> >>> +
> >>> /**
> >>> * xe_tile_report_driver_error - Count driver error for tile
> >>> * @tile: Tile to count error for
> >>> - * @err: enum error type
> >>> + * @err: Enum error type
> >>> + * @fmt: debug message format to print error
> >>> + * @...: variable args to print error
> >>> *
> >>> * Increment the driver error counter in respective error
> >>> * category for this tile.
> >>> @@ -82,11 +89,26 @@
> >>> * Returns void.
> >
> > I will check the documentation.
> >
> >>
> >> nit: is this a proper way to document "void" functions ?
> >>
> >>> */
> >>> void xe_tile_report_driver_error(struct xe_tile *tile,
> >>> - const enum xe_tile_drv_err_type err)
> >>> + const enum xe_tile_drv_err_type err,
> >>> + const char *fmt, ...)
> >>> {
> >>> + struct va_format vaf;
> >>> + va_list args;
> >>> +
> >>> xe_assert(tile_to_xe(tile), err >= ARRAY_SIZE(tile->drv_err_cnt));
> >>> WRITE_ONCE(tile->drv_err_cnt[err],
> >>> READ_ONCE(tile->drv_err_cnt[err]) + 1);
> >>> +
> >>> + va_start(args, fmt);
> >>> + vaf.fmt = fmt;
> >>> + vaf.va = &args;
> >>> +
> >>> + BUILD_BUG_ON(ARRAY_SIZE(xe_tile_drv_err_to_str) !=
> >>> + XE_TILE_DRV_ERR_MAX);
> >>
> >> move to xe_tile_driver_error_to_string() instead
> >
> > Like said above, such API is not needed.
>
> so maybe at least make this xe_tile_drv_err_to_str[] static ?
> it doesn't look to be used outside this function
Ok, looks like now we can make it static. Thanks I will do that.
Thanks,
Tejas
>
> Michal
>
> >
> > Thanks,
> > Tejas
> >>
> >>> +
> >>> + drm_err(&tile->xe->drm, "TILE%u [%s] %pV",
> >>> + tile->id, xe_tile_drv_err_to_str[err], &vaf);
> >>> + va_end(args);
> >>> }
> >>>
> >>> /**
> >>> diff --git a/drivers/gpu/drm/xe/xe_tile.h
> >>> b/drivers/gpu/drm/xe/xe_tile.h index 092a6b17a97e..1e23acdb9c81
> >>> 100644
> >>> --- a/drivers/gpu/drm/xe/xe_tile.h
> >>> +++ b/drivers/gpu/drm/xe/xe_tile.h
> >>> @@ -14,7 +14,9 @@ int xe_tile_alloc(struct xe_tile *tile); int
> >>> xe_tile_init_noalloc(struct xe_tile *tile);
> >>>
> >>> void xe_tile_migrate_wait(struct xe_tile *tile);
> >>> +extern const char *const xe_tile_drv_err_to_str[];
> >>
> >> don't expose it, hide in xe_tile_driver_error_to_string() if needed
> >>
> >>> void xe_tile_report_driver_error(struct xe_tile *tile,
> >>> - const enum xe_tile_drv_err_type err);
> >>> + const enum xe_tile_drv_err_type err,
> >>> + const char *fmt, ...);
> >>>
> >>> #endif
More information about the Intel-xe
mailing list