[PATCH v7 07/10] drm/xe/guc: Dead CT helper

Wed Sep 11 00:09:34 UTC 2024

On 9/5/2024 13:51, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> Add a worker function helper for asynchronously dumping state when an
> internal/fatal error is detected in CT processing. Being asynchronous
> is required to avoid deadlocks and scheduling-while-atomic or
> process-stalled-for-too-long issues. Also check for a bunch more error
> conditions and improve the handling of some existing checks.
>
> v2: Use compile time CONFIG check for new (but not directly CT_DEAD
> related) checks and use unsigned int for a bitmask, rename
> CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments,
> rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W.
> Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the
> entire mask to zero.
> v3: Fix kerneldoc
> v4: Nullify some floating pointers after free.
> v5: Add section headings and device info to make the state dump look
> more like a devcoredump to allow parsing by the same tools (eventual
> aim is to just call the devcoredump code itself, but that currently
> requires an xe_sched_job, which is not available in the CT code).
>
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> ---
>   .../drm/xe/abi/guc_communication_ctb_abi.h    |   1 +
>   drivers/gpu/drm/xe/xe_guc.c                   |   2 +-
>   drivers/gpu/drm/xe/xe_guc_ct.c                | 280 ++++++++++++++++--
>   drivers/gpu/drm/xe/xe_guc_ct.h                |   2 +-
>   drivers/gpu/drm/xe/xe_guc_ct_types.h          |  23 ++
>   5 files changed, 280 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> index 8f86a16dc577..f58198cf2cf6 100644
> --- a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> @@ -52,6 +52,7 @@ struct guc_ct_buffer_desc {
>   #define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
>   #define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
>   #define GUC_CTB_STATUS_MISMATCH				(1 << 2)
> +#define GUC_CTB_STATUS_DISABLED				(1 << 3)
>   	u32 reserved[13];
>   } __packed;
>   static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index 34cdb08b6e27..3fef24c965c4 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -1176,7 +1176,7 @@ void xe_guc_print_info(struct xe_guc *guc, struct drm_printer *p)
>   
>   	xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>   
> -	xe_guc_ct_print(&guc->ct, p, false);
> +	xe_guc_ct_print(&guc->ct, p);
>   	xe_guc_submit_print(guc, p);
>   }
>   
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index a63fe0a9077a..e31b1f0b855f 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -25,12 +25,57 @@
>   #include "xe_gt_sriov_pf_monitor.h"
>   #include "xe_gt_tlb_invalidation.h"
>   #include "xe_guc.h"
> +#include "xe_guc_log.h"
>   #include "xe_guc_relay.h"
>   #include "xe_guc_submit.h"
>   #include "xe_map.h"
>   #include "xe_pm.h"
>   #include "xe_trace_guc.h"
>   
> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> +enum {
> +	CT_DEAD_REARM,				/* 0x0001 - not an error condition */
> +	CT_DEAD_SETUP,				/* 0x0002 */
> +	CT_DEAD_H2G_WRITE,			/* 0x0004 */
> +	CT_DEAD_H2G_HAS_ROOM,			/* 0x0008 */
> +	CT_DEAD_G2H_READ,			/* 0x0010 */
> +	CT_DEAD_G2H_RECV,			/* 0x0020 */
> +	CT_DEAD_G2H_RELEASE,			/* 0x0040 */
> +	CT_DEAD_DEADLOCK,			/* 0x0080 */
> +	CT_DEAD_PROCESS_FAILED,			/* 0x0100 */
> +	CT_DEAD_FAST_G2H,			/* 0x0200 */
> +	CT_DEAD_PARSE_G2H_RESPONSE,		/* 0x0400 */
> +	CT_DEAD_PARSE_G2H_UNKNOWN,		/* 0x0800 */
> +	CT_DEAD_PARSE_G2H_ORIGIN,		/* 0x1000 */
> +	CT_DEAD_PARSE_G2H_TYPE,			/* 0x2000 */
> +};
> +
> +static void ct_dead_worker_func(struct work_struct *w);
> +
> +#define CT_DEAD(ct, ctb, reason_code)								\
> +	do {											\
> +		struct guc_ctb *_ctb = (ctb);							\
> +		if (_ctb)									\
> +			_ctb->info.broken = true;						\
> +		if (!(ct)->dead.reported) {							\
> +			struct xe_guc *guc = ct_to_guc(ct);					\
> +			spin_lock_irq(&ct->dead.lock);						\
This needs to be spin_lock_irqsave because CT_DEAD can be called inside 
an ISR. Without the save/restore, it will explicitly enable interrupts 
again and, as seen in the CI failure, that causes a WARNING:
     irq 210 handler dg1_irq_handler+0x0/0x240 [xe] enabled interrupts

John.