[PATCH v3 1/9] drm/xe/guc: Add register defines for GuC based register capture
Matt Roper
matthew.d.roper at intel.com
Mon Jan 22 21:39:56 UTC 2024
On Thu, Jan 18, 2024 at 04:41:55PM -0800, Zhanjun Dong wrote:
> Add registers defines and list of registers for GuC based error state capture.
>
> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
> ---
> drivers/gpu/drm/xe/Kconfig | 11 +++
> drivers/gpu/drm/xe/Makefile | 1 +
> drivers/gpu/drm/xe/regs/xe_engine_regs.h | 12 +++
> drivers/gpu/drm/xe/regs/xe_gt_regs.h | 20 +++++
> drivers/gpu/drm/xe/xe_guc.c | 5 ++
> drivers/gpu/drm/xe/xe_guc_capture.c | 108 +++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_guc_capture.h | 15 ++++
> 7 files changed, 172 insertions(+)
> create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.c
> create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.h
>
> diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> index 1b57ae38210d..236763569877 100644
> --- a/drivers/gpu/drm/xe/Kconfig
> +++ b/drivers/gpu/drm/xe/Kconfig
> @@ -83,6 +83,17 @@ config DRM_XE_FORCE_PROBE
>
> Use "!*" to block the probe of the driver for all known devices.
>
> +config DRM_XE_CAPTURE_ERROR
> + bool "Enable capturing GPU state following a hang"
> + depends on DRM_XE
> + default y
> + help
> + This option enables capturing the GPU state when a hang is detected.
> + This information is vital for triaging hangs and assists in debugging.
> + Please report any hang to your Intel representative to help with triaging.
> +
> + If in doubt, say "Y".
> +
The commit message said that this was just adding register defines, but
you're actually adding new files and build options as well. That should
probably all happen as a separate patch.
> menu "drm/Xe Debugging"
> depends on DRM_XE
> depends on EXPERT
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index fe8b266a9819..6182f89a6bd5 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -92,6 +92,7 @@ xe-y += xe_bb.o \
> xe_gt_topology.o \
> xe_guc.o \
> xe_guc_ads.o \
> + xe_guc_capture.o \
> xe_guc_ct.o \
> xe_guc_db_mgr.o \
> xe_guc_debugfs.o \
> diff --git a/drivers/gpu/drm/xe/regs/xe_engine_regs.h b/drivers/gpu/drm/xe/regs/xe_engine_regs.h
> index 0b1266c88a6a..06015703a33e 100644
> --- a/drivers/gpu/drm/xe/regs/xe_engine_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_engine_regs.h
> @@ -64,10 +64,16 @@
>
> #define RING_ACTHD_UDW(base) XE_REG((base) + 0x5c)
> #define RING_DMA_FADD_UDW(base) XE_REG((base) + 0x60)
> +#define RING_IPEIR(base) XE_REG((base) + 0x64)
There's no such register on any platform supported by Xe; BDW (gen8) was
the last platform that had this.
As a reminder, i915 dumps a whole bunch of invalid registers and data
that don't actually apply to any modern platform. I think cleaning that
all up has been on the todo list for a long time. At least with Xe
we're starting fresh so we can make sure that we're dumping just the
registers that actually exist and are useful for debugging, and make
sure that we're dumping them accurately; we don't want to just blindly
copy/paste register dump stuff over from i915 since a lot of it is just
unwanted bitrot.
> #define RING_IPEHR(base) XE_REG((base) + 0x68)
> +#define RING_INSTDONE(base) XE_REG((base) + 0x6c)
> +#define RING_INSTPS(base) XE_REG((base) + 0x70)
> +
> #define RING_ACTHD(base) XE_REG((base) + 0x74)
> #define RING_DMA_FADD(base) XE_REG((base) + 0x78)
> #define RING_HWS_PGA(base) XE_REG((base) + 0x80)
> +#define IPEIR(base) XE_REG((base) + 0x88)
This looks like the same register as above, but a different offset that
hasn't existed on any platform documented by the current bspec tools.
That means this is probably an old gen3 offset of something; definitely
not the kind of thing that we should be putting into Xe.
You also don't use this definition in the tables at the end of this
patch either.
> +
> #define RING_HWSTAM(base) XE_REG((base) + 0x98)
> #define RING_MI_MODE(base) XE_REG((base) + 0x9c)
> #define RING_NOPID(base) XE_REG((base) + 0x94)
> @@ -111,9 +117,12 @@
> #define FF_DOP_CLOCK_GATE_DISABLE REG_BIT(1)
> #define REPLAY_MODE_GRANULARITY REG_BIT(0)
>
> +#define RING_BBSTATE(base) XE_REG((base) + 0x110)
> #define RING_BBADDR(base) XE_REG((base) + 0x140)
> #define RING_BBADDR_UDW(base) XE_REG((base) + 0x168)
>
> +#define CCID(base) XE_REG((base) + 0x180)
> +
> #define BCS_SWCTRL(base) XE_REG((base) + 0x200, XE_REG_OPTION_MASKED)
> #define BCS_SWCTRL_DISABLE_256B REG_BIT(2)
>
> @@ -129,6 +138,9 @@
> #define CTX_CTRL_INHIBIT_SYN_CTX_SWITCH REG_BIT(3)
> #define CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT REG_BIT(0)
>
> +#define RING_PDP_UDW(base, n) XE_REG((base) + 0x270 + (n) * 8 + 4)
> +#define RING_PDP_LDW(base, n) XE_REG((base) + 0x270 + (n) * 8)
What is the goal of dumping these? I don't think they're relevant for
modern usage are they?
> +
> #define RING_MODE(base) XE_REG((base) + 0x29c)
> #define GFX_DISABLE_LEGACY_MODE REG_BIT(3)
>
> diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> index 0d4bfc35ff37..46e3395f57ef 100644
> --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> @@ -67,6 +67,8 @@
> #define VE1_AUX_INV XE_REG(0x42b8)
> #define AUX_INV REG_BIT(0)
>
> +#define AUX_ERR_DBG XE_REG(0x43f4)
> +
This is a multicast register. Also it doesn't exist anymore on Xe2 and
beyond but you have it on a "COMMON" list that implies it would apply to
all platforms.
Dumping registers incorrectly (e.g., printing a single value for a
multicast register that actually has several different values) is more
misleading than not printing the register at all. That's why it's
important to make sure each register is being dumped accurately. We
also need to justify why we're including various registers; i915
included a bunch of garbage that nobody cared about (which was obvious
since we dumped incorrect values for some registers for years and nobody
noticed). We should keep the Xe list restricted to just the registers
that we and our userspace partners would actually find useful. Adding
unwanted registers just increases the maintenance burden and will lead
to Xe's error dump turning into the same kind of graveyard i915's
became.
> #define XEHP_TILE_ADDR_RANGE(_idx) XE_REG_MCR(0x4900 + (_idx) * 4)
> #define XEHP_FLAT_CCS_BASE_ADDR XE_REG_MCR(0x4910)
>
> @@ -94,6 +96,8 @@
> #define FF_MODE2_TDS_TIMER_MASK REG_GENMASK(23, 16)
> #define FF_MODE2_TDS_TIMER_128 REG_FIELD_PREP(FF_MODE2_TDS_TIMER_MASK, 4)
>
> +#define XEHPG_INSTDONE_GEOM_SVG XE_REG_MCR(0x666c)
> +
> #define CACHE_MODE_1 XE_REG(0x7004, XE_REG_OPTION_MASKED)
> #define MSAA_OPTIMIZATION_REDUC_DISABLE REG_BIT(11)
>
> @@ -110,6 +114,10 @@
> #define FLSH_IGNORES_PSD REG_BIT(10)
> #define FD_END_COLLECT REG_BIT(5)
>
> +#define SC_INSTDONE XE_REG(0x7100)
> +#define SC_INSTDONE_EXTRA XE_REG(0x7104)
> +#define SC_INSTDONE_EXTRA2 XE_REG(0x7108)
These are multicast registers too.
> +
> #define COMMON_SLICE_CHICKEN4 XE_REG(0x7300, XE_REG_OPTION_MASKED)
> #define DISABLE_TDC_LOAD_BALANCING_CALC REG_BIT(6)
>
> @@ -299,6 +307,11 @@
>
> #define XE2LPM_L3SQCREG5 XE_REG_MCR(0xb658)
>
> +#define FAULT_TLB_DATA0 XE_REG(0xceb8)
> +#define FAULT_TLB_DATA1 XE_REG(0xcebc)
Also multicast.
> +
> +#define RING_FAULT_REG XE_REG(0xcec4)
Ditto.
> +
> #define XEHP_MERT_MOD_CTRL XE_REG_MCR(0xcf28)
> #define RENDER_MOD_CTRL XE_REG_MCR(0xcf2c)
> #define COMP_MOD_CTRL XE_REG_MCR(0xcf30)
> @@ -317,6 +330,11 @@
> #define INVALIDATION_BROADCAST_MODE_DIS REG_BIT(12)
> #define GLOBAL_INVALIDATION_MODE REG_BIT(2)
>
> +#define GAM_DONE XE_REG(0xcf68)
Ditto.
> +
> +#define SAMPLER_INSTDONE XE_REG_MCR(0xe160)
> +#define ROW_INSTDONE XE_REG_MCR(0xe164)
> +
> #define HALF_SLICE_CHICKEN5 XE_REG_MCR(0xe188, XE_REG_OPTION_MASKED)
> #define DISABLE_SAMPLE_G_PERFORMANCE REG_BIT(0)
>
> @@ -484,6 +502,8 @@
> #define GT_CS_MASTER_ERROR_INTERRUPT REG_BIT(3)
> #define GT_RENDER_USER_INTERRUPT REG_BIT(0)
>
> +#define SFC_DONE(n) XE_REG(0x1cc000 + (n) * 0x1000)
> +
> #define PVC_GT0_PACKAGE_ENERGY_STATUS XE_REG(0x281004)
> #define PVC_GT0_PACKAGE_RAPL_LIMIT XE_REG(0x281008)
> #define PVC_GT0_PACKAGE_POWER_SKU_UNIT XE_REG(0x281068)
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index 2891b0cc4f7f..63587db6a548 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -17,6 +17,7 @@
> #include "xe_force_wake.h"
> #include "xe_gt.h"
> #include "xe_guc_ads.h"
> +#include "xe_guc_capture.h"
> #include "xe_guc_ct.h"
> #include "xe_guc_hwconfig.h"
> #include "xe_guc_log.h"
> @@ -290,6 +291,10 @@ int xe_guc_init(struct xe_guc *guc)
> if (ret)
> goto out;
>
> + ret = xe_guc_capture_init(guc);
> + if (ret)
> + goto out;
> +
> ret = xe_guc_ads_init(&guc->ads);
> if (ret)
> goto out;
> diff --git a/drivers/gpu/drm/xe/xe_guc_capture.c b/drivers/gpu/drm/xe/xe_guc_capture.c
> new file mode 100644
> index 000000000000..cacd50f4718a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_capture.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2021-2022 Intel Corporation
> + */
> +
> +#include <linux/types.h>
> +
> +#include <drm/drm_print.h>
> +
> +#include "abi/guc_actions_abi.h"
> +#include "regs/xe_regs.h"
> +#include "regs/xe_engine_regs.h"
> +#include "regs/xe_gt_regs.h"
> +#include "regs/xe_guc_regs.h"
> +
> +#include "xe_bo.h"
> +#include "xe_device.h"
> +#include "xe_exec_queue_types.h"
> +#include "xe_hw_engine_types.h"
> +#include "xe_gt.h"
> +#include "xe_gt_printk.h"
> +#include "xe_guc.h"
> +#include "xe_guc_capture.h"
> +#include "xe_guc_ct.h"
> +
> +#include "xe_guc_log.h"
> +#include "xe_gt_mcr.h"
> +#include "xe_guc_submit.h"
> +#include "xe_macros.h"
> +#include "xe_map.h"
> +
> +#if IS_ENABLED(CONFIG_DRM_XE_CAPTURE_ERROR)
> +
> +/*
> + * Define all device tables of GuC error capture register lists
The tables below don't really make sense yet because they don't get used
anywhere yet and it isn't even clear what the structure is (i.e., the
the "0, 0" part doesn't relate to anything else in this patch).
In general it would probably be better to approach this from the other
direction --- add the general infrastructure to just dump the registers
we already have definitions for first, then follow up with extra patches
that add additional registers and include them in the appropriate lists,
along with an explanation for why each set of registers is useful to
dump.
Matt
> + * NOTE: For engine-registers, GuC only needs the register offsets
> + * from the engine-mmio-base
> + */
> +#define COMMON_XELP_BASE_GLOBAL \
> + { FORCEWAKE_GT, 0, 0, "FORCEWAKE" }, \
> + { FAULT_TLB_DATA0, 0, 0, "FAULT_TLB_DATA0" }, \
> + { FAULT_TLB_DATA1, 0, 0, "FAULT_TLB_DATA1" }, \
> + { AUX_ERR_DBG, 0, 0, "AUX_ERR_DBG" }, \
> + { GAM_DONE, 0, 0, "GAM_DONE" }, \
> + { RING_FAULT_REG, 0, 0, "FAULT_REG" }
> +
> +#define COMMON_BASE_ENGINE_INSTANCE \
> + { RING_PSMI_CTL(0), 0, 0, "RC PSMI" }, \
> + { RING_ESR(0), 0, 0, "ESR" }, \
> + { RING_EMR(0), 0, 0, "EMR" }, \
> + { RING_EIR(0), 0, 0, "EIR" }, \
> + { RING_EXECLIST_STATUS_HI(0), 0, 0, "RING_EXECLIST_STATUS_HI" }, \
> + { RING_EXECLIST_STATUS_LO(0), 0, 0, "RING_EXECLIST_STATUS_LO" }, \
> + { RING_DMA_FADD(0), 0, 0, "RING_DMA_FADD_LDW" }, \
> + { RING_DMA_FADD_UDW(0), 0, 0, "RING_DMA_FADD_UDW" }, \
> + { RING_IPEIR(0), 0, 0, "IPEIR" }, \
> + { RING_IPEHR(0), 0, 0, "IPEHR" }, \
> + { RING_INSTPS(0), 0, 0, "INSTPS" }, \
> + { RING_BBADDR(0), 0, 0, "RING_BBADDR_LOW32" }, \
> + { RING_BBADDR_UDW(0), 0, 0, "RING_BBADDR_UP32" }, \
> + { RING_BBSTATE(0), 0, 0, "BB_STATE" }, \
> + { CCID(0), 0, 0, "CCID" }, \
> + { RING_ACTHD(0), 0, 0, "ACTHD_LDW" }, \
> + { RING_ACTHD_UDW(0), 0, 0, "ACTHD_UDW" }, \
> + { INSTPM(0), 0, 0, "INSTPM" }, \
> + { RING_INSTDONE(0), 0, 0, "INSTDONE" }, \
> + { RING_NOPID(0), 0, 0, "RING_NOPID" }, \
> + { RING_START(0), 0, 0, "START" }, \
> + { RING_HEAD(0), 0, 0, "HEAD" }, \
> + { RING_TAIL(0), 0, 0, "TAIL" }, \
> + { RING_CTL(0), 0, 0, "CTL" }, \
> + { RING_MI_MODE(0), 0, 0, "MODE" }, \
> + { RING_CONTEXT_CONTROL(0), 0, 0, "RING_CONTEXT_CONTROL" }, \
> + { RING_HWS_PGA(0), 0, 0, "HWS" }, \
> + { RING_MODE(0), 0, 0, "GFX_MODE" }, \
> + { RING_PDP_LDW(0, 0), 0, 0, "PDP0_LDW" }, \
> + { RING_PDP_UDW(0, 0), 0, 0, "PDP0_UDW" }, \
> + { RING_PDP_LDW(0, 1), 0, 0, "PDP1_LDW" }, \
> + { RING_PDP_UDW(0, 1), 0, 0, "PDP1_UDW" }, \
> + { RING_PDP_LDW(0, 2), 0, 0, "PDP2_LDW" }, \
> + { RING_PDP_UDW(0, 2), 0, 0, "PDP2_UDW" }, \
> + { RING_PDP_LDW(0, 3), 0, 0, "PDP3_LDW" }, \
> + { RING_PDP_UDW(0, 3), 0, 0, "PDP3_UDW" }
> +
> +#define COMMON_XELP_BASE_RENDER \
> + { SC_INSTDONE, 0, 0, "SC_INSTDONE" }, \
> + { SC_INSTDONE_EXTRA, 0, 0, "SC_INSTDONE_EXTRA" }, \
> + { SC_INSTDONE_EXTRA2, 0, 0, "SC_INSTDONE_EXTRA2" }
> +
> +#define COMMON_XELP_BASE_VEC \
> + { SFC_DONE(0), 0, 0, "SFC_DONE[0]" }, \
> + { SFC_DONE(1), 0, 0, "SFC_DONE[1]" }, \
> + { SFC_DONE(2), 0, 0, "SFC_DONE[2]" }, \
> + { SFC_DONE(3), 0, 0, "SFC_DONE[3]" }
> +
> +int xe_guc_capture_init(struct xe_guc *guc)
> +{
> + return 0;
> +}
> +
> +#else /* IS_ENABLED(CONFIG_DRM_XE_CAPTURE_ERROR) */
> +
> +int xe_guc_capture_init(struct xe_guc *guc)
> +{
> + return 0;
> +}
> +
> +#endif /* IS_ENABLED(CONFIG_DRM_XE_CAPTURE_ERROR) */
> diff --git a/drivers/gpu/drm/xe/xe_guc_capture.h b/drivers/gpu/drm/xe/xe_guc_capture.h
> new file mode 100644
> index 000000000000..3caea2c6fffe
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_capture.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021-2021 Intel Corporation
> + */
> +
> +#ifndef _XE_GUC_CAPTURE_H
> +#define _XE_GUC_CAPTURE_H
> +
> +#include <linux/types.h>
> +
> +struct xe_guc;
> +
> +int xe_guc_capture_init(struct xe_guc *guc);
> +
> +#endif /* _XE_GUC_CAPTURE_H */
> --
> 2.34.1
>
--
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation
More information about the Intel-xe
mailing list