[Intel-xe] [PATCH v2 2/2] drm/xe: Add reg read/write trace

Thu Jun 22 18:50:14 UTC 2023

Hi Lucas/Matt,

> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of Lucas De
> Marchi
> Sent: Wednesday, June 21, 2023 3:58 PM
> To: Roper, Matthew D <matthew.d.roper at intel.com>
> Cc: intel-xe at lists.freedesktop.org
> Subject: Re: [Intel-xe] [PATCH v2 2/2] drm/xe: Add reg read/write trace
> 
> On Tue, Jun 20, 2023 at 07:10:43PM -0700, Matt Roper wrote:
> >On Tue, Jun 20, 2023 at 05:25:49PM -0700, Lucas De Marchi wrote:
> >> On Tue, Jun 20, 2023 at 10:38:50PM +0000, Radhakrishna Sripada wrote:
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: De Marchi, Lucas <lucas.demarchi at intel.com>
> >> > > Sent: Tuesday, June 20, 2023 2:28 PM
> >> > > To: Sripada, Radhakrishna <radhakrishna.sripada at intel.com>
> >> > > Cc: intel-xe at lists.freedesktop.org; Roper, Matthew D
> >> > > <matthew.d.roper at intel.com>
> >> > > Subject: Re: [Intel-xe] [PATCH v2 2/2] drm/xe: Add reg read/write trace
> >> > >
> >> > > On Tue, Jun 20, 2023 at 12:32:29PM -0700, Radhakrishna Sripada wrote:
> >> > > >This will help debug register read/writes and provide
> >> > > >a way to trace all the mmio transactions.
> >> > > >
> >> > > >In order to avoid spam from xe_mmio_wait32,
> >> > > >xe_mmio_read32_notrace is introduced and used.
> >> > >
> >> > > feedback given before was to avoid the exported _notrace stuff. If there
> >> > > was a justification for it somewhere I probably lost it. Could you
> >> > > clarify? If I understand the above paragraph correctly, it's due to
> >> > > the fact the xe_mmio_wait32() reads the register multiple times...
> >> > > why would we want to avoid the "spam"? If we are reading it multiple
> >> > > times, the trace should show it instead of hiding.  If we demonstrate
> >> > > there is an actual impact, then xe_mmio_wait32() should be adjusted, not
> >> > > the rest of the shim layer in intel_uncore.h
> >> > Unless I missed some whaere earlier feedback I received was on the usage
> of inline in function definitions.
> >> > Xe_mmio_wait32 I believe is loosely based out of
> __intel_wait_for_register(atleast based on compat-headers)
> >> > Where during the actual check, no trace version of register read is used and
> just the final value is traced before
> >> > Returning from the function. A similar kind of approach is adapted here.
> >> >
> >> > If it does not make much value like that then I believe we can use a
> traceable version
> >> > But in case if developers want to have a stack trace for each tracee out put
> then the spam becomes even more
> >> > messy.
> >>
> >> there was a chat last week about that... maybe I misremembered and you
> >> were not part of it.  The trace part should be pretty cheap to be kept
> >> everywhere, particularly because it's not enabled by default it's
> >> nothing more than a few nop instructions.
> >>
> >> If we don't trace everything and hide some of them behind functions like
> >> xe_mmio_wait32() or intel_uncore_read_fw(), with no indication in their
> >> name that they won't trace, then when we are reading the trace we spend
> >> a long time tracking back what exactly was happening.
> >> When we have the complete trace, filtering should be pretty
> >> easy. Eliminating random traces means we lose information that is not
> >> possible to recover after the fact.
> >
> >I could see an argument for avoiding tracepoint i915:i915_reg_rw within
> >xe_mmio_wait32() if that function utilized separate tracepoints like
> >i915:i915_reg_wait_start and i915:i915_reg_wait_end instead.  That could
> >potentially consolidate the repeated trace output down to just two lines
> >without losing any information.  We don't have tracepoints like those
> >today though.
> 
> you mean xe:xe_mmio_wait_start/xe:xe_mmio_wait_end, right?
> Yes, that would be good and the non-tracing functions would be local to
> xe_mmio rather than exported for everybody to shoot their own foot.

So in order to summarize the discussions, we can reduce(if possible eliminate)
the number of non_trace functions being exported, introduce 2 more trace points
namely xe:xe_mmio_wait_start/xe:xe_mmio_wait_end(it should be cheap to
implement), if we define the current xe_reg_rw as an event class and only use the
nontrace variant  inside xe_mmio_wait32.

However the question still remains on the display side. There is an i915 patchset [1]
which should eliminate the nesting of trace events, for which non trace versions of
register read/writes need to be used. The motive is to avoid nested trace calls. Though
in my recent tests it did not cause any problem, I think it is good to avoid it overall.

Can you suggest a path forward? IMO we have to minimize the export of non trace functions
just to be used under other traces(however there is no way to guarantee that those functions
are not called other times.)

Thoughts?

Radhakrishna(RK) Sripada
> 
> Lucas De Marchi
> 
> >
> >
> >Matt
> >
> >>
> >> >
> >> > >
> >> > > >
> >> > > >v2: Do not use inline for the function declarations.(MattR)
> >> > > >
> >> > > >Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> >> > > >Cc: Matt Roper <matthew.d.roper at intel.com>
> >> > > >Signed-off-by: Radhakrishna Sripada <radhakrishna.sripada at intel.com>
> >> > > >---
> >> > > > .../drm/xe/compat-i915-headers/intel_uncore.h |  8 +-
> >> > > > drivers/gpu/drm/xe/xe_mmio.c                  | 80 ++++++++++++++++++-
> >> > > > drivers/gpu/drm/xe/xe_mmio.h                  |  5 ++
> >> > > > drivers/gpu/drm/xe/xe_trace.h                 | 26 ++++++
> >> > > > 4 files changed, 111 insertions(+), 8 deletions(-)
> >> > > >
> >> > > >diff --git a/drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h
> >> > > b/drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h
> >> > > >index fae6213d26f1..036f3b963a84 100644
> >> > > >--- a/drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h
> >> > > >+++ b/drivers/gpu/drm/xe/compat-i915-headers/intel_uncore.h
> >> > > >@@ -113,7 +113,7 @@ static inline u32 intel_uncore_read_fw(struct
> >> > > fake_uncore *uncore,
> >> > > > {
> >> > > > 	struct xe_reg reg = XE_REG(i915_mmio_reg_offset(i915_reg));
> >> > > >
> >> > > >-	return xe_mmio_read32(__fake_uncore_to_gt(uncore), reg);
> >> > > >+	return
> xe_mmio_read32_notrace(__fake_uncore_to_gt(uncore), reg);
> >> > >
> >> > > how would you even know that intel_uncore_read_fw() is non-traceable?
> >> > > ditto for all the other functions in this header.
> >> > In i915 intel_uncore_read_fw and other _fw functions translate to
> raw_uncore_read
> >> > And its variants which do not include the GEN6 Header/footer which have
> the traces
> >> > included and hence are the non-trace variants. Unless I am missing
> something.
> >>
> >> I meant "as a user of the API". There's no clue whatsoever in
> >> intel_uncore_read_fw() that it will not trace. I think the selective
> >> trace/no-trace in random places in i915 was a mistake.
> >>
> >> Lucas De Marchi
> >>
> >> >
> >> > --Radhakrishna(RK) Sripada
> >> > >
> >> > >
> >> > > Lucas De Marchi
> >> > >
> >> > > > }
> >> > > >
> >> > > > static inline void intel_uncore_write_fw(struct fake_uncore *uncore,
> >> > > >@@ -121,7 +121,7 @@ static inline void intel_uncore_write_fw(struct
> >> > > fake_uncore *uncore,
> >> > > > {
> >> > > > 	struct xe_reg reg = XE_REG(i915_mmio_reg_offset(i915_reg));
> >> > > >
> >> > > >-	xe_mmio_write32(__fake_uncore_to_gt(uncore), reg, val);
> >> > > >+	xe_mmio_write32_notrace(__fake_uncore_to_gt(uncore), reg,
> val);
> >> > > > }
> >> > > >
> >> > > > static inline u32 intel_uncore_read_notrace(struct fake_uncore
> *uncore,
> >> > > >@@ -129,7 +129,7 @@ static inline u32
> intel_uncore_read_notrace(struct
> >> > > fake_uncore *uncore,
> >> > > > {
> >> > > > 	struct xe_reg reg = XE_REG(i915_mmio_reg_offset(i915_reg));
> >> > > >
> >> > > >-	return xe_mmio_read32(__fake_uncore_to_gt(uncore), reg);
> >> > > >+	return
> xe_mmio_read32_notrace(__fake_uncore_to_gt(uncore), reg);
> >> > > > }
> >> > > >
> >> > > > static inline void intel_uncore_write_notrace(struct fake_uncore
> *uncore,
> >> > > >@@ -137,7 +137,7 @@ static inline void
> intel_uncore_write_notrace(struct
> >> > > fake_uncore *uncore,
> >> > > > {
> >> > > > 	struct xe_reg reg = XE_REG(i915_mmio_reg_offset(i915_reg));
> >> > > >
> >> > > >-	xe_mmio_write32(__fake_uncore_to_gt(uncore), reg, val);
> >> > > >+	xe_mmio_write32_notrace(__fake_uncore_to_gt(uncore), reg,
> val);
> >> > > > }
> >> > > >
> >> > > > #endif /* __INTEL_UNCORE_H__ */
> >> > > >diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> b/drivers/gpu/drm/xe/xe_mmio.c
> >> > > >index 78c7f839f783..ef2192a2d060 100644
> >> > > >--- a/drivers/gpu/drm/xe/xe_mmio.c
> >> > > >+++ b/drivers/gpu/drm/xe/xe_mmio.c
> >> > > >@@ -19,6 +19,7 @@
> >> > > > #include "xe_gt_mcr.h"
> >> > > > #include "xe_macros.h"
> >> > > > #include "xe_module.h"
> >> > > >+#include "xe_trace.h"
> >> > > >
> >> > > > #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> >> > > > #define TILE_COUNT		REG_GENMASK(15, 8)
> >> > > >@@ -429,7 +430,7 @@ static const struct xe_reg
> mmio_read_whitelist[] = {
> >> > > > 	RING_TIMESTAMP(RENDER_RING_BASE),
> >> > > > };
> >> > > >
> >> > > >-u8 xe_mmio_read8(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+u8 xe_mmio_read8_notrace(struct xe_gt *gt, struct xe_reg reg)
> >> > > > {
> >> > > > 	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >
> >> > > >@@ -439,6 +440,31 @@ u8 xe_mmio_read8(struct xe_gt *gt, struct
> xe_reg
> >> > > reg)
> >> > > > 	return readb(tile->mmio.regs + reg.addr);
> >> > > > }
> >> > > >
> >> > > >+u8 xe_mmio_read8(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+{
> >> > > >+	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >+	u8 val;
> >> > > >+
> >> > > >+	if (reg.addr < gt->mmio.adj_limit)
> >> > > >+		reg.addr += gt->mmio.adj_offset;
> >> > > >+
> >> > > >+	val = readb(tile->mmio.regs + reg.addr);
> >> > > >+
> >> > > >+	trace_xe_reg_rw(false, &reg, val, sizeof(val));
> >> > > >+
> >> > > >+	return val;
> >> > > >+}
> >> > > >+
> >> > > >+void xe_mmio_write32_notrace(struct xe_gt *gt, struct xe_reg reg,
> u32 val)
> >> > > >+{
> >> > > >+	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >+
> >> > > >+	if (reg.addr < gt->mmio.adj_limit)
> >> > > >+		reg.addr += gt->mmio.adj_offset;
> >> > > >+
> >> > > >+	writel(val, tile->mmio.regs + reg.addr);
> >> > > >+}
> >> > > >+
> >> > > > void xe_mmio_write32(struct xe_gt *gt, struct xe_reg reg, u32 val)
> >> > > > {
> >> > > > 	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >@@ -446,10 +472,12 @@ void xe_mmio_write32(struct xe_gt *gt,
> struct
> >> > > xe_reg reg, u32 val)
> >> > > > 	if (reg.addr < gt->mmio.adj_limit)
> >> > > > 		reg.addr += gt->mmio.adj_offset;
> >> > > >
> >> > > >+	trace_xe_reg_rw(true, &reg, val, sizeof(val));
> >> > > >+
> >> > > > 	writel(val, tile->mmio.regs + reg.addr);
> >> > > > }
> >> > > >
> >> > > >-u32 xe_mmio_read32(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+u32 xe_mmio_read32_notrace(struct xe_gt *gt, struct xe_reg reg)
> >> > > > {
> >> > > > 	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >
> >> > > >@@ -459,6 +487,21 @@ u32 xe_mmio_read32(struct xe_gt *gt, struct
> xe_reg
> >> > > reg)
> >> > > > 	return readl(tile->mmio.regs + reg.addr);
> >> > > > }
> >> > > >
> >> > > >+u32 xe_mmio_read32(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+{
> >> > > >+	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >+	u32 val;
> >> > > >+
> >> > > >+	if (reg.addr < gt->mmio.adj_limit)
> >> > > >+		reg.addr += gt->mmio.adj_offset;
> >> > > >+
> >> > > >+	val = readl(tile->mmio.regs + reg.addr);
> >> > > >+
> >> > > >+	trace_xe_reg_rw(false, &reg, val, sizeof(val));
> >> > > >+
> >> > > >+	return val;
> >> > > >+}
> >> > > >+
> >> > > > u32 xe_mmio_rmw32(struct xe_gt *gt, struct xe_reg reg, u32 clr, u32
> set)
> >> > > > {
> >> > > > 	u32 old, reg_val;
> >> > > >@@ -470,6 +513,16 @@ u32 xe_mmio_rmw32(struct xe_gt *gt, struct
> xe_reg
> >> > > reg, u32 clr, u32 set)
> >> > > > 	return old;
> >> > > > }
> >> > > >
> >> > > >+void xe_mmio_write64_notrace(struct xe_gt *gt, struct xe_reg reg,
> u64 val)
> >> > > >+{
> >> > > >+	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >+
> >> > > >+	if (reg.addr < gt->mmio.adj_limit)
> >> > > >+		reg.addr += gt->mmio.adj_offset;
> >> > > >+
> >> > > >+	writeq(val, tile->mmio.regs + reg.addr);
> >> > > >+}
> >> > > >+
> >> > > > void xe_mmio_write64(struct xe_gt *gt, struct xe_reg reg, u64 val)
> >> > > > {
> >> > > > 	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >@@ -477,10 +530,12 @@ void xe_mmio_write64(struct xe_gt *gt,
> struct
> >> > > xe_reg reg, u64 val)
> >> > > > 	if (reg.addr < gt->mmio.adj_limit)
> >> > > > 		reg.addr += gt->mmio.adj_offset;
> >> > > >
> >> > > >+	trace_xe_reg_rw(true, &reg, val, sizeof(val));
> >> > > >+
> >> > > > 	writeq(val, tile->mmio.regs + reg.addr);
> >> > > > }
> >> > > >
> >> > > >-u64 xe_mmio_read64(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+u64 xe_mmio_read64_notrace(struct xe_gt *gt, struct xe_reg reg)
> >> > > > {
> >> > > > 	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >
> >> > > >@@ -490,6 +545,21 @@ u64 xe_mmio_read64(struct xe_gt *gt, struct
> xe_reg
> >> > > reg)
> >> > > > 	return readq(tile->mmio.regs + reg.addr);
> >> > > > }
> >> > > >
> >> > > >+u64 xe_mmio_read64(struct xe_gt *gt, struct xe_reg reg)
> >> > > >+{
> >> > > >+	struct xe_tile *tile = gt_to_tile(gt);
> >> > > >+	u64 val;
> >> > > >+
> >> > > >+	if (reg.addr < gt->mmio.adj_limit)
> >> > > >+		reg.addr += gt->mmio.adj_offset;
> >> > > >+
> >> > > >+	val = readq(tile->mmio.regs + reg.addr);
> >> > > >+
> >> > > >+	trace_xe_reg_rw(false, &reg, val, sizeof(val));
> >> > > >+
> >> > > >+	return val;
> >> > > >+}
> >> > > >+
> >> > > > int xe_mmio_write32_and_verify(struct xe_gt *gt, struct xe_reg reg,
> u32 val,
> >> > > > 			       u32 mask, u32 eval)
> >> > > > {
> >> > > >@@ -511,7 +581,7 @@ int xe_mmio_wait32(struct xe_gt *gt, struct
> xe_reg
> >> > > reg, u32 val, u32 mask,
> >> > > > 	u32 read;
> >> > > >
> >> > > > 	for (;;) {
> >> > > >-		read = xe_mmio_read32(gt, reg);
> >> > > >+		read = xe_mmio_read32_notrace(gt, reg);
> >> > > > 		if ((read & mask) == val) {
> >> > > > 			ret = 0;
> >> > > > 			break;
> >> > > >@@ -531,6 +601,8 @@ int xe_mmio_wait32(struct xe_gt *gt, struct
> xe_reg
> >> > > reg, u32 val, u32 mask,
> >> > > > 		wait <<= 1;
> >> > > > 	}
> >> > > >
> >> > > >+	trace_xe_reg_rw(false, &reg, read, sizeof(read));
> >> > > >+
> >> > > > 	if (out_val)
> >> > > > 		*out_val = read;
> >> > > >
> >> > > >diff --git a/drivers/gpu/drm/xe/xe_mmio.h
> b/drivers/gpu/drm/xe/xe_mmio.h
> >> > > >index c06fcac2cc5b..02364ba9dda2 100644
> >> > > >--- a/drivers/gpu/drm/xe/xe_mmio.h
> >> > > >+++ b/drivers/gpu/drm/xe/xe_mmio.h
> >> > > >@@ -20,11 +20,16 @@ struct xe_device;
> >> > > > #define GEN12_LMEM_BAR		2
> >> > > >
> >> > > > int xe_mmio_init(struct xe_device *xe);
> >> > > >+u8 xe_mmio_read8_notrace(struct xe_gt *gt, struct xe_reg reg);
> >> > > > u8 xe_mmio_read8(struct xe_gt *gt, struct xe_reg reg);
> >> > > >+void xe_mmio_write32_notrace(struct xe_gt *gt, struct xe_reg reg,
> u32 val);
> >> > > > void xe_mmio_write32(struct xe_gt *gt, struct xe_reg reg, u32 val);
> >> > > >+u32 xe_mmio_read32_notrace(struct xe_gt *gt, struct xe_reg reg);
> >> > > > u32 xe_mmio_read32(struct xe_gt *gt, struct xe_reg reg);
> >> > > > u32 xe_mmio_rmw32(struct xe_gt *gt, struct xe_reg reg, u32 clr, u32
> set);
> >> > > >+void xe_mmio_write64_notrace(struct xe_gt *gt, struct xe_reg reg,
> u64 val);
> >> > > > void xe_mmio_write64(struct xe_gt *gt, struct xe_reg reg, u64 val);
> >> > > >+u64 xe_mmio_read64_notrace(struct xe_gt *gt, struct xe_reg reg);
> >> > > > u64 xe_mmio_read64(struct xe_gt *gt, struct xe_reg reg);
> >> > > > int xe_mmio_write32_and_verify(struct xe_gt *gt, struct xe_reg reg,
> u32 val,
> >> > > > 			       u32 mask, u32 eval);
> >> > > >diff --git a/drivers/gpu/drm/xe/xe_trace.h
> b/drivers/gpu/drm/xe/xe_trace.h
> >> > > >index 2f8eb7ebe9a7..a5d514c190a2 100644
> >> > > >--- a/drivers/gpu/drm/xe/xe_trace.h
> >> > > >+++ b/drivers/gpu/drm/xe/xe_trace.h
> >> > > >@@ -12,6 +12,7 @@
> >> > > > #include <linux/tracepoint.h>
> >> > > > #include <linux/types.h>
> >> > > >
> >> > > >+#include "regs/xe_reg_defs.h"
> >> > > > #include "xe_bo_types.h"
> >> > > > #include "xe_engine_types.h"
> >> > > > #include "xe_gt_tlb_invalidation_types.h"
> >> > > >@@ -507,6 +508,31 @@ DEFINE_EVENT(xe_vm,
> xe_vm_rebind_worker_exit,
> >> > > > 	     TP_ARGS(vm)
> >> > > > );
> >> > > >
> >> > > >+TRACE_EVENT(xe_reg_rw,
> >> > > >+	    TP_PROTO(bool write, struct xe_reg *reg, u64 val, int len),
> >> > > >+	    TP_ARGS(write, reg, val, len),
> >> > > >+
> >> > > >+	    TP_STRUCT__entry(
> >> > > >+		    __field(u64, val)
> >> > > >+		    __field(struct xe_reg *, reg)
> >> > > >+		    __field(u16, write)
> >> > > >+		    __field(u16, len)
> >> > > >+		    ),
> >> > > >+
> >> > > >+	    TP_fast_assign(
> >> > > >+		   __entry->val = (u64)val;
> >> > > >+		   __entry->reg = reg;
> >> > > >+		   __entry->write = write;
> >> > > >+		   __entry->len = len;
> >> > > >+		),
> >> > > >+
> >> > > >+	    TP_printk("%s reg=0x%x, len=%d, val=(0x%x, 0x%x)",
> >> > > >+		      __entry->write ? "write" : "read",
> >> > > >+		      __entry->reg->addr, __entry->len,
> >> > > >+		      (u32)(__entry->val & 0xffffffff),
> >> > > >+		      (u32)(__entry->val >> 32))
> >> > > >+);
> >> > > >+
> >> > > > TRACE_EVENT(xe_guc_ct_h2g_flow_control,
> >> > > > 	    TP_PROTO(u32 _head, u32 _tail, u32 size, u32 space, u32
> len),
> >> > > > 	    TP_ARGS(_head, _tail, size, space, len),
> >> > > >--
> >> > > >2.34.1
> >> > > >
> >
> >--
> >Matt Roper
> >Graphics Software Engineer
> >Linux GPU Platform Enablement
> >Intel Corporation