[PATCH] drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from

Thu Mar 6 22:48:23 UTC 2025

On 06.03.2025 02:15, Daniele Ceraolo Spurio wrote:
> 
> 
> On 2/20/2025 7:14 PM, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> Most H2G messages are FAST_REQ which means no synchronous response is
>> expected. The messages are sent as fire-and-forget with no tracking.
>> However, errors can still be returned when something goes unexpectedly
>> wrong. That leads to confusion due to not being able to match up the
>> error response to the originating H2G.
>>
>> So add support for tracking the FAST_REQ H2Gs and matching up an error
>> response to its originator. This is only enabled in XE_DEBUG builds
>> given that such errors should never happen in a working system and
>> there is an overhead for the tracking.
>>
>> Further, if XE_DEBUG_GUC is enabled then even more memory and time is
>> used to record the call stack of each H2G and report that with an
>> error. That makes it much easier to work out where a specific H2G came
>> from if there are multiple code paths that can send it.
>>
>> Note, rather than create an extra Kconfig define for just this
>> feature, the XE_LARGE_GUC_BUFFER option has been re-used and renamed
>> to XE_DEBUG_GUC and is now just a general purpose 'verbose GuC debug'
>> option.
>>
>> Lastly, add a define to document FAST_REQ error 0x30C as being the
>> error most recently hit. Not sure why it was previously missing.
>>
>> Original-i915-code: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> ---
>>   drivers/gpu/drm/xe/Kconfig.debug        |  10 ++-
>>   drivers/gpu/drm/xe/abi/guc_errors_abi.h |   1 +
>>   drivers/gpu/drm/xe/xe_guc_ct.c          | 106 +++++++++++++++++++-----
>>   drivers/gpu/drm/xe/xe_guc_ct_types.h    |  15 ++++
>>   drivers/gpu/drm/xe/xe_guc_log.h         |   2 +-
>>   5 files changed, 111 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/Kconfig.debug b/drivers/gpu/drm/xe/
>> Kconfig.debug
>> index 0d749ed44878..ef2c456c3f2a 100644
>> --- a/drivers/gpu/drm/xe/Kconfig.debug
>> +++ b/drivers/gpu/drm/xe/Kconfig.debug
>> @@ -86,12 +86,16 @@ config DRM_XE_KUNIT_TEST
>>           If in doubt, say "N".
>>   -config DRM_XE_LARGE_GUC_BUFFER
>> -        bool "Enable larger guc log buffer"
>> +config DRM_XE_DEBUG_GUC
> 
> Do we need a maintainer ack for this rename?
> 
>> +        bool "Enable extra GuC related debug options"
>> +        depends on DRM_XE_DEBUG
>>           default n
>> +        select STACKDEPOT
>>           help
>>             Choose this option when debugging guc issues.
>> -          Buffer should be large enough for complex issues.
>> +          The GuC log buffer is increased to the maximum allowed,
>> which should
>> +          be large enough for complex issues. It also enables

s/It/This config

>> recording of the
>> +          stack when tracking FAST_REQ messages.
>>               Recommended for driver developers only.
>>   diff --git a/drivers/gpu/drm/xe/abi/guc_errors_abi.h b/drivers/gpu/
>> drm/xe/abi/guc_errors_abi.h
>> index 2c627a21648f..c25ea52a6e61 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_errors_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_errors_abi.h
>> @@ -40,6 +40,7 @@ enum xe_guc_response_status {
>>       XE_GUC_RESPONSE_CTB_NOT_REGISTERED                  = 0x304,
>>       XE_GUC_RESPONSE_CTB_IN_USE                          = 0x305,
>>       XE_GUC_RESPONSE_CTB_INVALID_DESC                    = 0x306,
>> +    XE_GUC_RESPONSE_STATUS_HW_TIMEOUT                   = 0x30C,

we don't use/need this specific error code in this patch, so maybe its
definition deserves a separate patch?

>>       XE_GUC_RESPONSE_CTB_SOURCE_INVALID_DESCRIPTOR       = 0x30D,
>>       XE_GUC_RESPONSE_CTB_DESTINATION_INVALID_DESCRIPTOR  = 0x30E,
>>       XE_GUC_RESPONSE_INVALID_CONFIG_STATE                = 0x30F,
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/
>> xe_guc_ct.c
>> index 72ad576fc18e..2d59934b87dc 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -624,6 +624,43 @@ static void g2h_release_space(struct xe_guc_ct
>> *ct, u32 g2h_len)
>>       spin_unlock_irq(&ct->fast_lock);
>>   }
>>   +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)

do we really want to be left without any clue when FAST_REQ action fails
on the production build? without saving the stack, the cost of tracking
all fast-requests should be really negligible

>> +static void fast_req_track(struct xe_guc_ct *ct, u16 fence, u16 action)
>> +{
>> +    unsigned int slot = fence % ARRAY_SIZE(ct->fast_req);
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +    unsigned long entries[SZ_32];
>> +    unsigned int n;
>> +
>> +    n = stack_trace_save(entries, ARRAY_SIZE(entries), 1);
>> +
>> +    /* May be called under spinlock, so avoid sleeping */
>> +    ct->fast_req[slot].stack = stack_depot_save(entries, n, GFP_NOWAIT);
> 
> From the name it looks like this "save" should be matched by a "delete",
> but I can't find any docs explicitly stating how this should be used and

from DOC:

 * Does not increment the refcount on the saved stack trace; see
 * stack_depot_save_flags() for more details.

saved stack trace will be kept in the depot forever ...

> other examples (both in i915 and outside) seem to also be missing the
> delete, so I'm assuming this is the correct way to use this.

we only have few points from where we send FAST_REQ so it shouldn't be a
problem if traces will stay in the depot even after we remove the driver

> 
>> +#endif
>> +    ct->fast_req[slot].fence = fence;
>> +    ct->fast_req[slot].action = action;
>> +}
>> +#endif
>> +
>> +/*
>> + * The CT protocol accepts a 16 bits fence. This field is fully owned
>> by the
>> + * driver, the GuC will just copy it to the reply message. Since we
>> need to
>> + * be able to distinguish between replies to REQUEST and FAST_REQUEST
>> messages,
>> + * we use one bit of the seqno as an indicator for that and a rolling
>> counter
>> + * for the remaining 15 bits.
>> + */
>> +#define CT_SEQNO_MASK GENMASK(14, 0)
>> +#define CT_SEQNO_UNTRACKED BIT(15)
>> +static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>> +{
>> +    u32 seqno = ct->fence_seqno++ & CT_SEQNO_MASK;
>> +
>> +    if (!is_g2h_fence)
>> +        seqno |= CT_SEQNO_UNTRACKED;
>> +
>> +    return seqno;
>> +}
>> +
>>   #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header
>> and one DW HxG header */
>>     static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32
>> len,
>> @@ -715,6 +752,12 @@ static int h2g_write(struct xe_guc_ct *ct, const
>> u32 *action, u32 len,
>>       xe_map_memcpy_to(xe, &map, H2G_CT_HEADERS * sizeof(u32), action,
>> len * sizeof(u32));
>>       xe_device_wmb(xe);
>>   +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +    if (ct_fence_value & CT_SEQNO_UNTRACKED)
>> +        fast_req_track(ct, ct_fence_value,
>> +                   FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
>> +#endif
>> +
>>       /* Update local copies */
>>       h2g->info.tail = (tail + full_len) % h2g->info.size;
>>       h2g_reserve_space(ct, full_len);
>> @@ -732,25 +775,6 @@ static int h2g_write(struct xe_guc_ct *ct, const
>> u32 *action, u32 len,
>>       return -EPIPE;
>>   }
>>   -/*
>> - * The CT protocol accepts a 16 bits fence. This field is fully owned
>> by the
>> - * driver, the GuC will just copy it to the reply message. Since we
>> need to
>> - * be able to distinguish between replies to REQUEST and FAST_REQUEST
>> messages,
>> - * we use one bit of the seqno as an indicator for that and a rolling
>> counter
>> - * for the remaining 15 bits.
>> - */
>> -#define CT_SEQNO_MASK GENMASK(14, 0)
>> -#define CT_SEQNO_UNTRACKED BIT(15)
>> -static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>> -{
>> -    u32 seqno = ct->fence_seqno++ & CT_SEQNO_MASK;
>> -
>> -    if (!is_g2h_fence)
>> -        seqno |= CT_SEQNO_UNTRACKED;
>> -
>> -    return seqno;
>> -}
>> -
>>   static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32
>> *action,
>>                   u32 len, u32 g2h_len, u32 num_g2h,
>>                   struct g2h_fence *g2h_fence)
>> @@ -1141,6 +1165,47 @@ static int guc_crash_process_msg(struct
>> xe_guc_ct *ct, u32 action)
>>       return 0;
>>   }
>>   +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +static void fast_req_report(struct xe_guc_ct *ct, u16 fence)
>> +{
>> +    unsigned int n;
>> +    bool found = false;
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +    char *buf;
>> +#endif
>> +
>> +    lockdep_assert_held(&ct->lock);
>> +
>> +    for (n = 0; n < ARRAY_SIZE(ct->fast_req); n++) {
>> +        if (ct->fast_req[n].fence != fence)
>> +            continue;
>> +        found = true;
>> +
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +        buf = kmalloc(SZ_4K, GFP_NOWAIT);
>> +        if (buf && stack_depot_snprint(ct->fast_req[n].stack, buf,
>> SZ_4K, 0))
>> +            xe_gt_err(ct_to_gt(ct), "Fence 0x%x was used by action
>> %#04x sent at\n%s",
>> +                  fence, ct->fast_req[n].action, buf);
>> +        else
>> +            xe_gt_err(ct_to_gt(ct), "Fence 0x%x was used by action
>> %#04x [failed to retrieve stack]\n",
>> +                  fence, ct->fast_req[n].action);
>> +        kfree(buf);
>> +#else
>> +        xe_gt_err(ct_to_gt(ct), "Fence 0x%x was used by action %#04x\n",
>> +              fence, ct->fast_req[n].action);
>> +#endif
>> +        break;
>> +    }
>> +
>> +    if (!found)
>> +        xe_gt_warn(ct_to_gt(ct), "FAST_REQ G2H fence 0x%x not found!
>> \n", fence);
> 
> Not convinced about this error message. the fast_req array is only 32
> entries deep, so it wouldn't be weird for entries to be overwritten in a
> busy system, but the read I get from this message is that something is
> wrong with the fact that we didn't find the fence. Maybe go for
> something like: "FAST_REQ G2H fence 0x%x action unknown". Not a blocker.
> 
>> +}
>> +#else
>> +static void fast_req_report(struct xe_guc_ct *ct, u16 fence)
>> +{
>> +}
>> +#endif
> 
> nit: for fast_req_track() you only define the function under
> CONFIG_DRM_XE_DEBUG and then you conditionally call it based on the
> define, while here you define it for both cases and call it
> unconditionally. Not a blocker, it just seems weird to have different
> approaches.
> 
>> +
>>   static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>   {
>>       struct xe_gt *gt =  ct_to_gt(ct);
>> @@ -1169,6 +1234,9 @@ static int parse_g2h_response(struct xe_guc_ct
>> *ct, u32 *msg, u32 len)
>>           else
>>               xe_gt_err(gt, "unexpected response %u for FAST_REQ H2G
>> fence 0x%x!\n",
>>                     type, fence);

hmm, to some extend, all errors are unexpected, and with this new
tracking we can emit just single but more meaningful message:

// stack available
	"H2G fast request %#x failed: error %#x hint %#x sent from %s\n"

// stack not available
	"H2G fast request %#x failed: error %#x hint %#x\n"

// unknown fence
	"H2G fast request failed: fence %#x error %#x hint %#x\n"

>> +
>> +        fast_req_report(ct, fence);
>> +
>>           CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
>>             return -EPROTO;
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/
>> xe/xe_guc_ct_types.h
>> index 8e1b9d981d61..c6b89b757a76 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> @@ -9,6 +9,7 @@
>>   #include <linux/interrupt.h>
>>   #include <linux/iosys-map.h>
>>   #include <linux/spinlock_types.h>
>> +#include <linux/stackdepot.h>
>>   #include <linux/wait.h>
>>   #include <linux/xarray.h>
>>   @@ -104,6 +105,18 @@ struct xe_dead_ct {
>>       /** snapshot_log: copy of GuC log at point of error */
>>       struct xe_guc_log_snapshot *snapshot_log;
>>   };
>> +
>> +/** struct xe_fast_req_fence - Used to track FAST_REQ messages to
>> match error responses */
>> +struct xe_fast_req_fence {

nit: name is little misleading

struct xe_fast_req_info ?
struct xe_fast_req_tracker ?
struct xe_fast_req_record ?

>> +    /** @fence: sequence number sent in H2G and return in G2H error */
>> +    u16 fence;

rest of the code uses "seqno"

>> +    /** @action: H2G action code */
>> +    u16 action;
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +    /** @stack: call stack from when the H2G was sent */
>> +    depot_stack_handle_t stack;
>> +#endif
>> +};
> 
> nit: should this whole struct be wrapped in CONFIG_DRM_XE_DEBUG? Not
> sure if any code analyzer would be smart enough to mark it as unused if
> CONFIG_DRM_XE_DEBUG is not set.
> 
> Apart from the nits this looks good to me:
> 
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> 
> Daniele
> 
>>   #endif
>>     /**
>> @@ -152,6 +165,8 @@ struct xe_guc_ct {
>>   #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>>       /** @dead: information for debugging dead CTs */
>>       struct xe_dead_ct dead;
>> +    /** @fast_req: history of FAST_REQ messages for matching with G2H
>> error responses*/
>> +    struct xe_fast_req_fence fast_req[SZ_32];
>>   #endif
>>   };
>>   diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/
>> xe_guc_log.h
>> index 5b896f5fafaf..f1e2b0be90a9 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_log.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_log.h
>> @@ -12,7 +12,7 @@
>>   struct drm_printer;
>>   struct xe_device;
>>   -#if IS_ENABLED(CONFIG_DRM_XE_LARGE_GUC_BUFFER)
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>>   #define CRASH_BUFFER_SIZE       SZ_1M
>>   #define DEBUG_BUFFER_SIZE       SZ_8M
>>   #define CAPTURE_BUFFER_SIZE     SZ_2M
>