[Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low level driver errors

Wed Oct 18 04:57:44 UTC 2023

> -----Original Message-----
> From: Iddamsetty, Aravind <aravind.iddamsetty at intel.com>
> Sent: Tuesday, October 17, 2023 7:37 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; Wajdeczko, Michal
> <Michal.Wajdeczko at intel.com>; intel-xe at lists.freedesktop.org
> Cc: Vishwanathapura at freedesktop.org
> Subject: Re: [Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low level
> driver errors
> 
> 
> 
> On 17-10-2023 18:10, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> >> Sent: Monday, October 16, 2023 8:05 PM
> >> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Cc: Vishwanathapura at freedesktop.org; Iddamsetty, Aravind
> >> <aravind.iddamsetty at intel.com>
> >> Subject: Re: [Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for low
> >> level driver errors
> >>
> >>
> >>
> >> On 16.10.2023 14:36, Upadhyay, Tejas wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> >>>> Sent: Friday, October 13, 2023 8:43 PM
> >>>> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> >>>> xe at lists.freedesktop.org
> >>>> Cc: Vishwanathapura at freedesktop.org
> >>>> Subject: Re: [Intel-xe] [PATCH V6 2/2] drm/xe: Update counter for
> >>>> low level driver errors
> >>>>
> >>>>
> >>>>
> >>>> On 13.10.2023 14:16, Tejas Upadhyay wrote:
> >>>>> we added a low level driver error counter and incrementing on each
> >>>>> occurrance. Focus is on errors that are not functionally
> >>>>
> >>>> typo
> >>>
> >>> Sure
> >>>
> >>>>
> >>>>> affecting the system and might otherwise go unnoticed and cause
> >>>>> power/performance regressions, so checking for the error counters
> >>>>> should help.
> >>>>>
> >>>>> Importantly the intention is not to go adding new error checks,
> >>>>> but to make sure the existing important error conditions are
> >>>>> propagated in terms of counter under respective categories like below :
> >>>>> Under GT:
> >>>>> gt_guc_communication,
> >>>>> gt_engine_other,
> >>>>> gt_other
> >>>>>
> >>>>> Under Tile:
> >>>>> gtt,
> >>>>> interrupt
> >>>>>
> >>>>> TODO: Currently this is just a counting of errors, later these
> >>>>> counters will be reported through netlink interface when it is
> >>>>> implemented and ready.
> >>>>>
> >>>>> V6:
> >>>>>   - move drm_err to gt and tile specific err API - Aravind
> >>>>>   - Use GTT naming instead of GGTT - Aravind/Niranjana
> >>>>> V5:
> >>>>>   - Dump err_type in string format
> >>>>> V4:
> >>>>>   - dump err_type in drm_err log - Himal
> >>>>> V2:
> >>>>>   - Use modified APIs
> >>>>>
> >>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>>>> ---
> >>>>>  drivers/gpu/drm/xe/xe_gt.c                  | 25 +++++++++++-
> >>>>>  drivers/gpu/drm/xe/xe_gt.h                  |  4 +-
> >>>>>  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 13 +++---
> >>>>>  drivers/gpu/drm/xe/xe_guc.c                 | 15 ++++---
> >>>>>  drivers/gpu/drm/xe/xe_guc_ct.c              | 44 +++++++++++----------
> >>>>>  drivers/gpu/drm/xe/xe_guc_pc.c              | 12 +++---
> >>>>>  drivers/gpu/drm/xe/xe_guc_submit.c          | 39 ++++++++++--------
> >>>>>  drivers/gpu/drm/xe/xe_irq.c                 |  6 ++-
> >>>>>  drivers/gpu/drm/xe/xe_reg_sr.c              | 22 +++++------
> >>>>>  drivers/gpu/drm/xe/xe_tile.c                | 26 +++++++++++-
> >>>>>  drivers/gpu/drm/xe/xe_tile.h                |  4 +-
> >>>>>  11 files changed, 139 insertions(+), 71 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c
> >>>>> b/drivers/gpu/drm/xe/xe_gt.c index b81de3984626..6f875b3886fd
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_gt.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c
> >>>>> @@ -47,10 +47,18 @@
> >>>>>  #include "xe_wa.h"
> >>>>>  #include "xe_wopcm.h"
> >>>>>
> >>>>> +const char *const xe_gt_drv_err_to_str[] = {
> >>>>> +	[XE_GT_DRV_ERR_GUC_COMM] = "GUC COMMUNICATION",
> >>>>> +	[XE_GT_DRV_ERR_ENGINE] = "ENGINE OTHER",
> >>>>> +	[XE_GT_DRV_ERR_OTHERS] = "GT OTHER"
> >>>>> +};
> >>>>
> >>>> instead of exposing table, better BKM is to define public helper
> >>>> function that converts enum to string, and keep such table as
> >>>> implementation detail
> >>>
> >>> It always better and suggested upstream approach to statically bind
> >>> values
> >> in array like this above. Also having so many switch cases degrades
> >> code execution. So in my opinion what we have above should work well.
> >>
> >> but you can still use table as map from enum to string my point is
> >> that exposing a raw table (without guard code) may lead to bugs,
> >> while encapsulating that inside a helper will be more robust
> >>
> >> also we don't expect these errors (and thus err strings) to happen
> >> frequently, so function call cost could be ignored
> >>
> >>>
> >>>>
> >>>>> +
> >>>>>  /**
> >>>>>   * xe_gt_report_driver_error - Count driver error for gt
> >>>>>   * @gt: GT to count error for
> >>>>>   * @err: enum error type
> >>>>> + * @fmt: debug message format to print error
> >>>>> + * @...: variable args to print error
> >>>>>   *
> >>>>>   * Increment the driver error counter in respective error
> >>>>>   * category for this GT.
> >>>>> @@ -58,11 +66,26 @@
> >>>>>   * Returns void.
> >>>>>   */
> >>>>>  void xe_gt_report_driver_error(struct xe_gt *gt,
> >>>>> -			       const enum xe_gt_drv_err_type err)
> >>>>> +			       const enum xe_gt_drv_err_type err,
> >>>>> +			       const char *fmt, ...)
> >>>>>  {
> >>>>> +	struct va_format vaf;
> >>>>> +	va_list args;
> >>>>> +
> >>>>>  	xe_gt_assert(gt, err >= ARRAY_SIZE(gt->drv_err_cnt));
> >>>>>  	WRITE_ONCE(gt->drv_err_cnt[err],
> >>>>>  		   READ_ONCE(gt->drv_err_cnt[err]) + 1);
> >>>>> +
> >>>>> +	va_start(args, fmt);
> >>>>> +	vaf.fmt = fmt;
> >>>>> +	vaf.va = &args;
> >>>>> +
> >>>>> +	BUILD_BUG_ON(ARRAY_SIZE(xe_gt_drv_err_to_str) !=
> >>>>> +		     XE_GT_DRV_ERR_MAX);
> >>>>> +
> >>>>> +	drm_err(&gt_to_xe(gt)->drm, "GT%u [%s] %pV",
> >>>>> +		gt->info.id, xe_gt_drv_err_to_str[err], &vaf);
> >>>>
> >>>> why not using existing xe_gt_err() which already adds "GT%u" prefix?
> >>>
> >>> Ok makes sense I will use xe_gt_err for GT error reporting.
> >>>
> >>>>
> >>>> also if you want to include error type string here, then maybe its
> >>>> worth to consider their shorter names ?
> >>>>
> >>>> 	"GUC COMMUNICATION" 	-> "GUC"
> >>>> 	"ENGINE OTHER"		-> "ENGINE"
> >>>> 	"GT OTHER"		-> "OTHER"
> >>>
> >>> No the names gives information that its GUC communication not GUC
> >> loading failures that we are going to report here. Similar for engine
> >> we have reset failures which we are not reporting instead we report
> >> other engine performance related errors. Similarly for GT no reset
> >> errors, only GT perf errors. So naming here is meaningful.
> >>>
> >>>>
> >>>> as otherwise they will be quite long:
> >>>>
> >>>> [ ] xe 0000:b7:00.0: [drm] *ERROR* GT0: [GT OTHER] ...
> >>>> [ ] xe 0000:b7:00.0: [drm] *ERROR* GT0: [GUC COMMUNICATION] ...
> >>>>
> >>>>> +	va_end(args);
> >>>>>  }
> >>>>>
> >>>>>  struct xe_gt *xe_gt_alloc(struct xe_tile *tile) diff --git
> >>>>> a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h index
> >>>>> 9442d615042f..374c6946620b 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_gt.h
> >>>>> +++ b/drivers/gpu/drm/xe/xe_gt.h
> >>>>> @@ -67,7 +67,9 @@ static inline bool xe_gt_is_usm_hwe(struct xe_gt
> >>>>> *gt,
> >>>> struct xe_hw_engine *hwe)
> >>>>>  		hwe->instance == gt->usm.reserved_bcs_instance;  }
> >>>>>
> >>>>> +extern const char *const xe_gt_drv_err_to_str[];
> >>>>>  void xe_gt_report_driver_error(struct xe_gt *gt,
> >>>>> -			       const enum xe_gt_drv_err_type err);
> >>>>> +			       const enum xe_gt_drv_err_type err,
> >>>>> +			       const char *fmt, ...);
> >>>>>
> >>>>>  #endif
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>>>> b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>>>> index bd6005b9d498..770ed7df8389 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> >>>>> @@ -9,6 +9,7 @@
> >>>>>  #include "xe_gt.h"
> >>>>>  #include "xe_guc.h"
> >>>>>  #include "xe_guc_ct.h"
> >>>>> +#include "xe_tile.h"
> >>>>>  #include "xe_trace.h"
> >>>>>
> >>>>>  #define TLB_TIMEOUT	(HZ / 4)
> >>>>> @@ -35,8 +36,10 @@ static void xe_gt_tlb_fence_timeout(struct
> >>>> work_struct *work)
> >>>>>  			break;
> >>>>>
> >>>>>  		trace_xe_gt_tlb_invalidation_fence_timeout(fence);
> >>>>> -		drm_err(&gt_to_xe(gt)->drm, "gt%d: TLB invalidation fence
> >>>> timeout, seqno=%d recv=%d",
> >>>>> -			gt->info.id, fence->seqno, gt-
> >>>>> tlb_invalidation.seqno_recv);
> >>>>> +		xe_tile_report_driver_error(gt_to_tile(gt),
> >>>> XE_TILE_DRV_ERR_GTT,
> >>>>> +					    "gt%d: TLB invalidation
> fence
> >>>> timeout, seqno=%d recv=%d",
> >>>>
> >>>> hmm, TLB invalidation code seems to be operating on the GT/GuC
> >>>> level, and timeouts here will have origin in requests sent (but
> >>>> lost or
> >>>> something) to GuC, so why suddenly we report this as a tile error ?
> >>>>
> >>>> note that this will also make message inconsistent:
> >>>> 	gt%u here vs GT%u elsewhere
> >>>
> >>> GGTT  is per tile and PPGTT is per GT. I will make sure to see if
> >>> further
> >> categorization of GGTT vs PPGTT needed while reporting error here.
> >
> > So I looked into this little in details, we come to this error
> > reporting from two places, 1. xe_gt_init --> xe_gt_tlb_invalidation_init -->
> xe_gt_tlb_fence_timeout, so this could be per GT and considered as PPGTT.
> > 2. xe_vm_invalidate_vma --> xe_gt_tlb_invalidation_wait --> this is getting
> called per tile and called from GGTT so can be considered GGTT.
> 
> both the above calls
> (xe_gt_tlb_fence_timeout/xe_gt_tlb_invalidation_wait) can happen for PPGTT
> or GGTT which you cannot identify at this level unless you identify that in the
> caller and pass that info to these.
> 
> Also since the TLB invalidation is issued to a GUC which is per GT atleast these
> errors shall be counted at GT level.

Ok, so looks like in that case we should let this stay as GTT only.

Thanks,
Tejas 
> 
> Thanks,
> Aravind.
> >
> > We can consider this categorization further.
> >
> > Thanks,
> > Tejas
> >>>
> >>>>
> >>>>> +					    gt->info.id, fence->seqno,
> >>>>> +					    gt-
> >tlb_invalidation.seqno_recv);
> >>>>>
> >>>>>  		list_del(&fence->link);
> >>>>>  		fence->base.error = -ETIME;
> >>>>> @@ -317,7 +320,6 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
> >>>>>   */
> >>>>>  int xe_gt_tlb_invalidation_wait(struct xe_gt *gt, int seqno)  {
> >>>>> -	struct xe_device *xe = gt_to_xe(gt);
> >>>>>  	struct xe_guc *guc = &gt->uc.guc;
> >>>>>  	int ret;
> >>>>>
> >>>>> @@ -329,8 +331,9 @@ int xe_gt_tlb_invalidation_wait(struct xe_gt
> >>>>> *gt, int
> >>>> seqno)
> >>>>>  				 tlb_invalidation_seqno_past(gt, seqno),
> >>>>>  				 TLB_TIMEOUT);
> >>>>>  	if (!ret) {
> >>>>> -		drm_err(&xe->drm, "gt%d: TLB invalidation time'd out,
> >>>> seqno=%d, recv=%d\n",
> >>>>> -			gt->info.id, seqno, gt->tlb_invalidation.seqno_recv);
> >>>>> +		xe_tile_report_driver_error(gt_to_tile(gt),
> >>>> XE_TILE_DRV_ERR_GTT,
> >>>>> +					    "gt%d: TLB invalidation
> time'd out,
> >>>> seqno=%d, recv=%d\n",
> >>>>> +					    gt->info.id, seqno, gt-
> >>>>> tlb_invalidation.seqno_recv);
> >>>>>  		return -ETIME;
> >>>>>  	}
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c
> >>>>> b/drivers/gpu/drm/xe/xe_guc.c index 84f0b5488783..c40d0cc46293
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
> >>>>> @@ -663,8 +663,9 @@ int xe_guc_mmio_send_recv(struct xe_guc
> *guc,
> >>>> const u32 *request,
> >>>>>  			     50000, &reply, false);
> >>>>>  	if (ret) {
> >>>>>  timeout:
> >>>>> -		drm_err(&xe->drm, "mmio request %#x: no reply %#x\n",
> >>>>> -			request[0], reply);
> >>>>> +		xe_gt_report_driver_error(gt,
> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "mmio request %#x: no reply
> >>>> %#x\n",
> >>>>> +					  request[0], reply);
> >>>>
> >>>> maybe it's worth to add helper:
> >>>>
> >>>> #define xe_guc_report_error(guc, fmt...) 	\
> >>>> 	xe_gt_report_driver_error( 		\
> >>>> 		guc_to_gt(guc),			\
> >>>> 		XE_GT_DRV_ERR_GUC_COMM, fmt)
> >>>
> >>> I think its not needed, anyway we will call
> >>> xe_gt_report_driver_error under
> >> it. Also if there will be more such categories like "guc" then we
> >> will keep on adding helpers.
> >>>
> >>>>
> >>>>>  		return ret;
> >>>>>  	}
> >>>>>
> >>>>> @@ -697,16 +698,18 @@ int xe_guc_mmio_send_recv(struct xe_guc
> >> *guc,
> >>>> const u32 *request,
> >>>>>  		u32 hint = FIELD_GET(GUC_HXG_FAILURE_MSG_0_HINT,
> >>>> header);
> >>>>>  		u32 error = FIELD_GET(GUC_HXG_FAILURE_MSG_0_ERROR,
> >>>> header);
> >>>>>
> >>>>> -		drm_err(&xe->drm, "mmio request %#x: failure %#x/%#x\n",
> >>>>> -			request[0], error, hint);
> >>>>> +		xe_gt_report_driver_error(gt,
> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "mmio request %#x: failure
> >>>> %#x/%#x\n",
> >>>>> +					  request[0], error, hint);
> >>>>>  		return -ENXIO;
> >>>>>  	}
> >>>>>
> >>>>>  	if (FIELD_GET(GUC_HXG_MSG_0_TYPE, header) !=
> >>>>>  	    GUC_HXG_TYPE_RESPONSE_SUCCESS) {
> >>>>>  proto:
> >>>>> -		drm_err(&xe->drm, "mmio request %#x: unexpected reply
> >>>> %#x\n",
> >>>>> -			request[0], header);
> >>>>> +		xe_gt_report_driver_error(gt,
> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "mmio request %#x:
> unexpected
> >>>> reply %#x\n",
> >>>>> +					  request[0], header);
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> >>>>> b/drivers/gpu/drm/xe/xe_guc_ct.c index 8b686c8b3339..1615f4e3c01e
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> >>>>> @@ -732,8 +732,9 @@ static int guc_ct_send_recv(struct xe_guc_ct
> >>>>> *ct, const u32 *action, u32 len,
> >>>>>
> >>>>>  	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> >>>>>  	if (!ret) {
> >>>>> -		drm_err(&xe->drm, "Timed out wait for G2H, fence %u,
> >>>> action %04x",
> >>>>> -			g2h_fence.seqno, action[0]);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Timed out wait for G2H,
> fence %u,
> >>>> action %04x",
> >>>>> +					  g2h_fence.seqno, action[0]);
> >>>>>  		xa_erase_irq(&ct->fence_lookup, g2h_fence.seqno);
> >>>>>  		return -ETIME;
> >>>>>  	}
> >>>>> @@ -744,8 +745,9 @@ static int guc_ct_send_recv(struct xe_guc_ct
> >>>>> *ct,
> >>>> const u32 *action, u32 len,
> >>>>>  		goto retry;
> >>>>>  	}
> >>>>>  	if (g2h_fence.fail) {
> >>>>> -		drm_err(&xe->drm, "Send failed, action 0x%04x, error %d,
> >>>> hint %d",
> >>>>> -			action[0], g2h_fence.error, g2h_fence.hint);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Send failed, action 0x%04x,
> error
> >>>> %d, hint %d",
> >>>>> +					  action[0], g2h_fence.error,
> >>>> g2h_fence.hint);
> >>>>>  		ret = -EIO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -829,7 +831,6 @@ static int parse_g2h_response(struct xe_guc_ct
> >>>>> *ct, u32 *msg, u32 len)
> >>>>>
> >>>>>  static int parse_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)  {
> >>>>> -	struct xe_device *xe = ct_to_xe(ct);
> >>>>>  	u32 hxg, origin, type;
> >>>>>  	int ret;
> >>>>>
> >>>>> @@ -839,9 +840,9 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>>>> u32 *msg, u32 len)
> >>>>>
> >>>>>  	origin = FIELD_GET(GUC_HXG_MSG_0_ORIGIN, hxg);
> >>>>>  	if (unlikely(origin != GUC_HXG_ORIGIN_GUC)) {
> >>>>> -		drm_err(&xe->drm,
> >>>>> -			"G2H channel broken on read, origin=%d, reset
> >>>> required\n",
> >>>>> -			origin);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "G2H channel broken on
> read,
> >>>> origin=%d, reset required\n",
> >>>>> +					  origin);
> >>>>>  		ct->ctbs.g2h.info.broken = true;
> >>>>>
> >>>>>  		return -EPROTO;
> >>>>> @@ -858,9 +859,9 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>>>> u32
> >>>> *msg, u32 len)
> >>>>>  		ret = parse_g2h_response(ct, msg, len);
> >>>>>  		break;
> >>>>>  	default:
> >>>>> -		drm_err(&xe->drm,
> >>>>> -			"G2H channel broken on read, type=%d, reset
> >>>> required\n",
> >>>>> -			type);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "G2H channel broken on
> read,
> >>>> type=%d, reset required\n",
> >>>>> +					  type);
> >>>>>  		ct->ctbs.g2h.info.broken = true;
> >>>>>
> >>>>>  		ret = -EOPNOTSUPP;
> >>>>> @@ -871,7 +872,6 @@ static int parse_g2h_msg(struct xe_guc_ct *ct,
> >>>>> u32 *msg, u32 len)
> >>>>>
> >>>>>  static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32
> >>>>> len) {
> >>>>> -	struct xe_device *xe = ct_to_xe(ct);
> >>>>>  	struct xe_guc *guc = ct_to_guc(ct);
> >>>>>  	u32 action = FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, msg[1]);
> >>>>>  	u32 *payload = msg + GUC_CTB_HXG_MSG_MIN_LEN; @@ -918,12
> >>>> +918,15 @@
> >>>>> static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
> >>>>>  							   adj_len);
> >>>>>  		break;
> >>>>>  	default:
> >>>>> -		drm_err(&xe->drm, "unexpected action 0x%04x\n", action);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "unexpected action
> 0x%04x\n",
> >>>>> +					  action);
> >>>>>  	}
> >>>>>
> >>>>>  	if (ret)
> >>>>> -		drm_err(&xe->drm, "action 0x%04x failed processing,
> >>>> ret=%d\n",
> >>>>> -			action, ret);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "action 0x%04x failed
> processing,
> >>>> ret=%d\n",
> >>>>> +					  action, ret);
> >>>>>
> >>>>>  	return 0;
> >>>>>  }
> >>>>> @@ -957,9 +960,9 @@ static int g2h_read(struct xe_guc_ct *ct, u32
> >>>>> *msg,
> >>>> bool fast_path)
> >>>>>  			   sizeof(u32));
> >>>>>  	len = FIELD_GET(GUC_CTB_MSG_0_NUM_DWORDS, msg[0]) +
> >>>> GUC_CTB_MSG_MIN_LEN;
> >>>>>  	if (len > avail) {
> >>>>> -		drm_err(&xe->drm,
> >>>>> -			"G2H channel broken on read, avail=%d, len=%d,
> >>>> reset required\n",
> >>>>> -			avail, len);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "G2H channel broken on
> read,
> >>>> avail=%d, len=%d, reset required\n",
> >>>>> +					  avail, len);
> >>>>>  		g2h->info.broken = true;
> >>>>>
> >>>>>  		return -EPROTO;
> >>>>> @@ -1027,8 +1030,9 @@ static void g2h_fast_path(struct xe_guc_ct
> >>>>> *ct, u32
> >>>> *msg, u32 len)
> >>>>>  	}
> >>>>>
> >>>>>  	if (ret)
> >>>>> -		drm_err(&xe->drm, "action 0x%04x failed processing,
> >>>> ret=%d\n",
> >>>>> -			action, ret);
> >>>>> +		xe_gt_report_driver_error(ct_to_gt(ct),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "action 0x%04x failed
> processing,
> >>>> ret=%d\n",
> >>>>> +					  action, ret);
> >>>>>  }
> >>>>>
> >>>>>  /**
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
> >>>>> b/drivers/gpu/drm/xe/xe_guc_pc.c index d9375d1d582f..1ee7b68a9fc4
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> >>>>> @@ -197,9 +197,9 @@ static int pc_action_query_task_state(struct
> >>>> xe_guc_pc *pc)
> >>>>>  	/* Blocking here to ensure the results are ready before reading
> >>>>> them
> >>>> */
> >>>>>  	ret = xe_guc_ct_send_block(ct, action, ARRAY_SIZE(action));
> >>>>>  	if (ret)
> >>>>> -		drm_err(&pc_to_xe(pc)->drm,
> >>>>> -			"GuC PC query task state failed: %pe", ERR_PTR(ret));
> >>>>> -
> >>>>> +		xe_gt_report_driver_error(pc_to_gt(pc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "GuC PC query task state
> failed:
> >>>> %pe",
> >>>>> +					  ERR_PTR(ret));
> >>>>
> >>>> are we sure we want to count same problem several times?
> >>>>
> >>>> note that xe_guc_ct_send_block() will likely already report that as
> >>>> "GUC COMM" error, should we do it again here ? or maybe this should
> >>>> be different category/class ?
> >>>
> >>> I will double check if double counting is done here. Ideally I would
> >>> make
> >> sure that 1 counting happens at end for 1 path of error.
> >>>
> >>>>
> >>>>>  	return ret;
> >>>>>  }
> >>>>>
> >>>>> @@ -219,9 +219,9 @@ static int pc_action_set_param(struct
> >>>>> xe_guc_pc *pc, u8 id, u32 value)
> >>>>>
> >>>>>  	ret = xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
> >>>>>  	if (ret)
> >>>>> -		drm_err(&pc_to_xe(pc)->drm, "GuC PC set param failed:
> >>>> %pe",
> >>>>> -			ERR_PTR(ret));
> >>>>> -
> >>>>> +		xe_gt_report_driver_error(pc_to_gt(pc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "GuC PC set param failed:
> %pe",
> >>>>> +					  ERR_PTR(ret));
> >>>>>  	return ret;
> >>>>>  }
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> >>>>> b/drivers/gpu/drm/xe/xe_guc_submit.c
> >>>>> index 870dc5c532fa..4b7fe6419d48 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> >>>>> @@ -1497,13 +1497,15 @@ g2h_exec_queue_lookup(struct xe_guc
> *guc,
> >>>> u32 guc_id)
> >>>>>  	struct xe_exec_queue *q;
> >>>>>
> >>>>>  	if (unlikely(guc_id >= GUC_ID_MAX)) {
> >>>>> -		drm_err(&xe->drm, "Invalid guc_id %u", guc_id);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid guc_id %u", guc_id);
> >>>>>  		return NULL;
> >>>>>  	}
> >>>>>
> >>>>>  	q = xa_load(&guc->submission_state.exec_queue_lookup, guc_id);
> >>>>>  	if (unlikely(!q)) {
> >>>>> -		drm_err(&xe->drm, "Not engine present for guc_id %u",
> >>>> guc_id);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Not engine present for
> guc_id %u",
> >>>> guc_id);
> >>>>>  		return NULL;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1527,12 +1529,12 @@ static void deregister_exec_queue(struct
> >>>>> xe_guc *guc, struct xe_exec_queue *q)
> >>>>>
> >>>>>  int xe_guc_sched_done_handler(struct xe_guc *guc, u32 *msg, u32
> >>>>> len) {
> >>>>> -	struct xe_device *xe = guc_to_xe(guc);
> >>>>>  	struct xe_exec_queue *q;
> >>>>>  	u32 guc_id = msg[0];
> >>>>>
> >>>>>  	if (unlikely(len < 2)) {
> >>>>> -		drm_err(&xe->drm, "Invalid length %u", len);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid length %u", len);
> >>>>
> >>>> nit: IMO such error messages are redundant as CTB will report
> >>>> failed action anyway (including message dump, where invalid length
> >>>> could be
> >>>> spotted)
> >>>
> >>> Can you please elaborate more here please?
> >>
> >> ok, so it looks that CTB implementation in Xe is less verbose than on
> >> i915, where all errors were correctly propagated and related message
> >> was included in the log
> >>
> >> on Xe this seems to be partially broken, as dequeue_one_g2h() always
> >> returns 0 while g2h_worker_func() is prepared to kick reset on EPROTO:
> >>
> >> g2h_worker_func
> >>   dequeue_one_g2h
> >>     process_g2h_msg(msg, len)
> >>       xe_guc_sched_done_handler(msg, len)
> >>         drm_err(..."Invalid length
> >>         return -EPROTO
> >>     if (ret)
> >>       drm_err("action 0x%04x failed processing, ret=%d\n",
> >>     return 0;
> >>
> >> if (ret == -EPROTO...) {
> >>   xe_guc_ct_print(ct, &p, false);
> >>   kick_reset(ct);
> >>
> >> Note than once EPROTO errors will be correctly propagated, then it
> >> will trigger a reset, thus it will conflict with your statement above
> >> that this error tracking does include only perf errors.
> >>
> >> "Similarly for GT no reset errors, only GT perf errors"
> >>
> >>>
> >>>>
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1542,8 +1544,9 @@ int xe_guc_sched_done_handler(struct
> xe_guc
> >>>>> *guc, u32 *msg, u32 len)
> >>>>>
> >>>>>  	if (unlikely(!exec_queue_pending_enable(q) &&
> >>>>>  		     !exec_queue_pending_disable(q))) {
> >>>>> -		drm_err(&xe->drm, "Unexpected engine state 0x%04x",
> >>>>> -			atomic_read(&q->guc->state));
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Unexpected engine state
> 0x%04x",
> >>>>> +					  atomic_read(&q->guc-
> >state));
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1572,12 +1575,12 @@ int xe_guc_sched_done_handler(struct
> >> xe_guc
> >>>>> *guc, u32 *msg, u32 len)
> >>>>>
> >>>>>  int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg,
> >>>>> u32
> >>>>> len)  {
> >>>>> -	struct xe_device *xe = guc_to_xe(guc);
> >>>>>  	struct xe_exec_queue *q;
> >>>>>  	u32 guc_id = msg[0];
> >>>>>
> >>>>>  	if (unlikely(len < 1)) {
> >>>>> -		drm_err(&xe->drm, "Invalid length %u", len);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid length %u", len);
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1587,8 +1590,9 @@ int xe_guc_deregister_done_handler(struct
> >>>> xe_guc
> >>>>> *guc, u32 *msg, u32 len)
> >>>>>
> >>>>>  	if (!exec_queue_destroyed(q) || exec_queue_pending_disable(q) ||
> >>>>>  	    exec_queue_pending_enable(q) || exec_queue_enabled(q)) {
> >>>>> -		drm_err(&xe->drm, "Unexpected engine state 0x%04x",
> >>>>> -			atomic_read(&q->guc->state));
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Unexpected engine state
> 0x%04x",
> >>>>> +					  atomic_read(&q->guc-
> >state));
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1611,7 +1615,8 @@ int xe_guc_exec_queue_reset_handler(struct
> >>>> xe_guc *guc, u32 *msg, u32 len)
> >>>>>  	u32 guc_id = msg[0];
> >>>>>
> >>>>>  	if (unlikely(len < 1)) {
> >>>>> -		drm_err(&xe->drm, "Invalid length %u", len);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid length %u", len);
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1646,7 +1651,8 @@ int
> >>>> xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc,
> u32
> >>>> *msg,
> >>>>>  	u32 guc_id = msg[0];
> >>>>>
> >>>>>  	if (unlikely(len < 1)) {
> >>>>> -		drm_err(&xe->drm, "Invalid length %u", len);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid length %u", len);
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1667,12 +1673,12 @@ int
> >>>>> xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc,
> u32
> >>>>> *msg,
> >>>>>
> >>>>>  int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc,
> >>>>> u32 *msg, u32 len)  {
> >>>>> -	struct xe_device *xe = guc_to_xe(guc);
> >>>>>  	u8 guc_class, instance;
> >>>>>  	u32 reason;
> >>>>>
> >>>>>  	if (unlikely(len != 3)) {
> >>>>> -		drm_err(&xe->drm, "Invalid length %u", len);
> >>>>> +		xe_gt_report_driver_error(guc_to_gt(guc),
> >>>> XE_GT_DRV_ERR_GUC_COMM,
> >>>>> +					  "Invalid length %u", len);
> >>>>>  		return -EPROTO;
> >>>>>  	}
> >>>>>
> >>>>> @@ -1681,8 +1687,9 @@ int
> >>>> xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32
> >>>> *msg,
> >>>> u32 le
> >>>>>  	reason = msg[2];
> >>>>>
> >>>>>  	/* Unexpected failure of a hardware feature, log an actual error */
> >>>>> -	drm_err(&xe->drm, "GuC engine reset request failed on %d:%d
> >>>> because 0x%08X",
> >>>>> -		guc_class, instance, reason);
> >>>>> +	xe_gt_report_driver_error(guc_to_gt(guc),
> XE_GT_DRV_ERR_ENGINE,
> >>>>> +				  "GuC engine reset request failed on
> %d:%d
> >>>> because 0x%08X",
> >>>>> +				  guc_class, instance, reason);
> >>>>>
> >>>>>  	xe_gt_reset_async(guc_to_gt(guc));
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c
> >>>>> b/drivers/gpu/drm/xe/xe_irq.c index def9369eb488..2d00baf2c127
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_irq.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_irq.c
> >>>>> @@ -18,6 +18,7 @@
> >>>>>  #include "xe_guc.h"
> >>>>>  #include "xe_hw_engine.h"
> >>>>>  #include "xe_mmio.h"
> >>>>> +#include "xe_tile.h"
> >>>>>
> >>>>>  /*
> >>>>>   * Interrupt registers for a unit are always consecutive and
> >>>>> ordered @@ -222,8 +223,9 @@ gt_engine_identity(struct xe_device
> *xe,
> >>>>>  		 !time_after32(local_clock() >> 10, timeout_ts));
> >>>>>
> >>>>>  	if (unlikely(!(ident & INTR_DATA_VALID))) {
> >>>>> -		drm_err(&xe->drm, "INTR_IDENTITY_REG%u:%u 0x%08x not
> >>>> valid!\n",
> >>>>> -			bank, bit, ident);
> >>>>> +		xe_tile_report_driver_error(gt_to_tile(mmio),
> >>>> XE_TILE_DRV_ERR_INTR,
> >>>>> +
> "INTR_IDENTITY_REG%u:%u
> >>>> 0x%08x not valid!\n",
> >>>>> +					    bank, bit, ident);
> >>>>>  		return 0;
> >>>>>  	}
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_reg_sr.c
> >>>>> b/drivers/gpu/drm/xe/xe_reg_sr.c index 87adefb56024..9ee62ff209eb
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_reg_sr.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_reg_sr.c
> >>>>> @@ -125,12 +125,12 @@ int xe_reg_sr_add(struct xe_reg_sr *sr,
> >>>>>  	return 0;
> >>>>>
> >>>>>  fail:
> >>>>> -	xe_gt_err(gt,
> >>>>> -		  "discarding save-restore reg %04lx (clear: %08x, set: %08x,
> >>>> masked: %s, mcr: %s): ret=%d\n",
> >>>>> -		  idx, e->clr_bits, e->set_bits,
> >>>>> -		  str_yes_no(e->reg.masked),
> >>>>> -		  str_yes_no(e->reg.mcr),
> >>>>> -		  ret);
> >>>>> +	xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS,
> >>>>> +				  "discarding save-restore reg %04lx
> (clear:
> >>>> %08x, set: %08x, masked: %s, mcr: %s): ret=%d\n",
> >>>>> +				  idx, e->clr_bits, e->set_bits,
> >>>>> +				  str_yes_no(e->reg.masked),
> >>>>> +				  str_yes_no(e->reg.mcr),
> >>>>> +				  ret);
> >>>>>  	reg_sr_inc_error(sr);
> >>>>>
> >>>>>  	return ret;
> >>>>> @@ -207,7 +207,7 @@ void xe_reg_sr_apply_mmio(struct xe_reg_sr
> >>>>> *sr,
> >>>> struct xe_gt *gt)
> >>>>>  	return;
> >>>>>
> >>>>>  err_force_wake:
> >>>>> -	xe_gt_err(gt, "Failed to apply, err=%d\n", err);
> >>>>> +	xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS,
> "Failed to
> >>>>> +apply, err=%d\n", err);
> >>>>>  }
> >>>>>
> >>>>>  void xe_reg_sr_apply_whitelist(struct xe_hw_engine *hwe) @@
> >>>>> -234,9
> >>>>> +234,9 @@ void xe_reg_sr_apply_whitelist(struct xe_hw_engine *hwe)
> >>>>>  	p = drm_debug_printer(KBUILD_MODNAME);
> >>>>>  	xa_for_each(&sr->xa, reg, entry) {
> >>>>>  		if (slot == RING_MAX_NONPRIV_SLOTS) {
> >>>>> -			xe_gt_err(gt,
> >>>>> -				  "hwe %s: maximum register whitelist slots
> >>>> (%d) reached, refusing to add more\n",
> >>>>> -				  hwe->name, RING_MAX_NONPRIV_SLOTS);
> >>>>> +			xe_gt_report_driver_error(gt,
> >>>> XE_GT_DRV_ERR_ENGINE,
> >>>>> +						  "hwe %s: maximum
> register
> >>>> whitelist slots (%d) reached, refusing to add more\n",
> >>>>> +						  hwe->name,
> >>>> RING_MAX_NONPRIV_SLOTS);
> >>>>>  			break;
> >>>>>  		}
> >>>>>
> >>>>> @@ -259,7 +259,7 @@ void xe_reg_sr_apply_whitelist(struct
> >>>>> xe_hw_engine
> >>>> *hwe)
> >>>>>  	return;
> >>>>>
> >>>>>  err_force_wake:
> >>>>> -	drm_err(&xe->drm, "Failed to apply, err=%d\n", err);
> >>>>> +	xe_gt_report_driver_error(gt, XE_GT_DRV_ERR_OTHERS,
> "Failed to
> >>>>> +apply, err=%d\n", err);
> >>>>>  }
> >>>>>
> >>>>>  /**
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_tile.c
> >>>>> b/drivers/gpu/drm/xe/xe_tile.c index 708dd385f2b1..0f0ec8f33a51
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_tile.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_tile.c
> >>>>> @@ -71,10 +71,17 @@
> >>>>>   *  - MOCS and PAT programming
> >>>>>   */
> >>>>>
> >>>>> +const char *const xe_tile_drv_err_to_str[] = {
> >>>>> +	[XE_TILE_DRV_ERR_GTT] = "GTT",
> >>>>> +	[XE_TILE_DRV_ERR_INTR] = "INTERRUPT"
> >>>>
> >>>> "IRQ" ?
> >>>
> >>> These are furious reported interrupts errors which does not have
> >>> handlers,
> >> so naming is correct.
> >>>
> >>>>
> >>>>> +};
> >>>>> +
> >>>>>  /**
> >>>>>   * xe_tile_report_driver_error - Count driver error for tile
> >>>>>   * @tile: Tile to count error for
> >>>>> - * @err: enum error type
> >>>>> + * @err: Enum error type
> >>>>> + * @fmt: debug message format to print error
> >>>>> + * @...: variable args to print error
> >>>>>   *
> >>>>>   * Increment the driver error counter in respective error
> >>>>>   * category for this tile.
> >>>>> @@ -82,11 +89,26 @@
> >>>>>   * Returns void.
> >>>
> >>> I will check the documentation.
> >>>
> >>>>
> >>>> nit: is this a proper way to document "void" functions ?
> >>>>
> >>>>>   */
> >>>>>  void xe_tile_report_driver_error(struct xe_tile *tile,
> >>>>> -				 const enum xe_tile_drv_err_type err)
> >>>>> +				 const enum xe_tile_drv_err_type err,
> >>>>> +				 const char *fmt, ...)
> >>>>>  {
> >>>>> +	struct va_format vaf;
> >>>>> +	va_list args;
> >>>>> +
> >>>>>  	xe_assert(tile_to_xe(tile), err >= ARRAY_SIZE(tile->drv_err_cnt));
> >>>>>  	WRITE_ONCE(tile->drv_err_cnt[err],
> >>>>>  		   READ_ONCE(tile->drv_err_cnt[err]) + 1);
> >>>>> +
> >>>>> +	va_start(args, fmt);
> >>>>> +	vaf.fmt = fmt;
> >>>>> +	vaf.va = &args;
> >>>>> +
> >>>>> +	BUILD_BUG_ON(ARRAY_SIZE(xe_tile_drv_err_to_str) !=
> >>>>> +		     XE_TILE_DRV_ERR_MAX);
> >>>>
> >>>> move to xe_tile_driver_error_to_string() instead
> >>>
> >>> Like said above, such API is not needed.
> >>
> >> so maybe at least make this xe_tile_drv_err_to_str[] static ?
> >> it doesn't look to be used outside this function
> >>
> >> Michal
> >>
> >>>
> >>> Thanks,
> >>> Tejas
> >>>>
> >>>>> +
> >>>>> +	drm_err(&tile->xe->drm, "TILE%u [%s] %pV",
> >>>>> +		tile->id, xe_tile_drv_err_to_str[err], &vaf);
> >>>>> +	va_end(args);
> >>>>>  }
> >>>>>
> >>>>>  /**
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_tile.h
> >>>>> b/drivers/gpu/drm/xe/xe_tile.h index 092a6b17a97e..1e23acdb9c81
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_tile.h
> >>>>> +++ b/drivers/gpu/drm/xe/xe_tile.h
> >>>>> @@ -14,7 +14,9 @@ int xe_tile_alloc(struct xe_tile *tile);  int
> >>>>> xe_tile_init_noalloc(struct xe_tile *tile);
> >>>>>
> >>>>>  void xe_tile_migrate_wait(struct xe_tile *tile);
> >>>>> +extern const char *const xe_tile_drv_err_to_str[];
> >>>>
> >>>> don't expose it, hide in xe_tile_driver_error_to_string() if needed
> >>>>
> >>>>>  void xe_tile_report_driver_error(struct xe_tile *tile,
> >>>>> -				 const enum xe_tile_drv_err_type err);
> >>>>> +				 const enum xe_tile_drv_err_type err,
> >>>>> +				 const char *fmt, ...);
> >>>>>
> >>>>>  #endif