[Intel-xe] [PATCH] drm/xe/guc: Use FAST_REQUEST for non-blocking H2G messages
Daniele Ceraolo Spurio
daniele.ceraolospurio at intel.com
Thu Nov 16 21:40:07 UTC 2023
On 11/16/2023 9:12 AM, Michal Wajdeczko wrote:
>
> On 16.11.2023 00:40, Daniele Ceraolo Spurio wrote:
>> We're currently sending non-blocking H2G messages using the EVENT type,
>> which suppresses all CTB protocol replies from the GuC, including the
>> failure cases. This might cause errors to slip through and manifest as
>> unexpected behavior (e.g. a context state might not be what the driver
>> thinks it is because the state change command was silently rejected by
>> the GuC). To avoid this kind of problems, we can use the FAST_REQUEST
>> type instead, which suppresses the reply only on success; this way we
>> still get the advantage of not having to wait for an ack from the GuC
>> (i.e. the H2G is still non-blocking) while still detecting errors.
>> Since we don't track the non-blocking messages, we don't exactly know
>> which one of them failed and therefore can't attempt to recover and need
>> to escalate to GT reset instead.
>>
>> Note that FAST_REQUEST failures are NOT expected and are usually a sign
>> that the H2G was either malformed or requested an illegal operation.
>>
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>> Cc: John Harrison <John.C.Harrison at Intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> ---
>>
>> In the testing I did locally I didn't see any failure responses, but I
>> didn't run the full CI suite. If failure responses do come out from the
>> CI runs, I'll try to fix them before merging this patch.
>>
>> drivers/gpu/drm/xe/abi/guc_messages_abi.h | 1 +
>> drivers/gpu/drm/xe/xe_guc_ct.c | 28 +++++++++++++++++++++--
>> drivers/gpu/drm/xe/xe_guc_ct_types.h | 2 +-
>> 3 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/abi/guc_messages_abi.h b/drivers/gpu/drm/xe/abi/guc_messages_abi.h
>> index 3d199016cf88..e517e2a740d2 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_messages_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_messages_abi.h
>> @@ -46,6 +46,7 @@
>> #define GUC_HXG_MSG_0_TYPE (0x7 << 28)
>> #define GUC_HXG_TYPE_REQUEST 0u
>> #define GUC_HXG_TYPE_EVENT 1u
>> +#define GUC_HXG_TYPE_FAST_REQUEST 2u
> you should also update doc above (bits 30:28)
ok
>
>> #define GUC_HXG_TYPE_NO_RESPONSE_BUSY 3u
>> #define GUC_HXG_TYPE_NO_RESPONSE_RETRY 5u
>> #define GUC_HXG_TYPE_RESPONSE_FAILURE 6u
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index a84e111bb36a..81d727b72709 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -448,7 +448,7 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>> GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
>> } else {
>> cmd[1] =
>> - FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_EVENT) |
>> + FIELD_PREP(GUC_HXG_MSG_0_TYPE, GUC_HXG_TYPE_FAST_REQUEST) |
>> FIELD_PREP(GUC_HXG_EVENT_MSG_0_ACTION |
>> GUC_HXG_EVENT_MSG_0_DATA0, action[0]);
>> }
>> @@ -505,7 +505,11 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>> if (g2h_fence_needs_alloc(g2h_fence)) {
>> void *ptr;
>>
>> - g2h_fence->seqno = (ct->fence_seqno++ & 0xffff);
>> + /* we reserve fence 0 to indicate unset */
> why ? it's not required by the HXG protocol
This was because the EVENT/FAST_REQUEST submission always have fence 0,
so I wanted to make sure we didn't use that for normal REQUEST as well
>
>> + g2h_fence->seqno = ct->fence_seqno++;
>> + if (unlikely(!g2h_fence->seqno))
>> + g2h_fence->seqno = ct->fence_seqno++;
>> +
>> ptr = xa_store(&ct->fence_lookup,
>> g2h_fence->seqno,
>> g2h_fence, GFP_ATOMIC);
>> @@ -792,6 +796,26 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>> u32 type = FIELD_GET(GUC_HXG_MSG_0_TYPE, msg[1]);
>> struct g2h_fence *g2h_fence;
>>
>> + /*
>> + * FAST_REQUEST messages don't have a fence and therefore are
>> + * not tracked in ct->fence_lookup. Note that those messages should
> but why we don't use fences for FAST_REQUEST ?
This is not a change in this patch, fence is currently unset for
non-blocking H2Gs.
>
> even if we don't expect any failures, from the protocol POV this is
> another separate flow, and we shouldn't reuse same 0 (unset) fence for
> multiple requests sent one by one.
>
> it also will make any debugging much harder (if we get such failure
> response) to figure out and match that error with actual request
> (without deep analysis of GuC log, that might be lost)
I can assign a fence to FAST_REQUEST, but I'll still need a way to
distinguish between normal requests and fast ones here. I can't just
assume that a request was a fast one if it isn't in ct->fence_lookup,
because as the comment below says the fence might've been dropped from
the xarray due to timeout. I'll try reserving a bit in the seqno to mark
it as belonging to a fast request, since the H2G protocol doesn't care
if the seqnos are not continuous.
>
>> + * never fail, so if we do get an error back it means we're likely
>> + * doing an illegal operation and the GuC is rejecting it. Since we
>> + * don't know which action failed and therefore can't attempt to
>> + * recover, we need to escalate the failure to trigger a reset.
>> + */
>> + if (!fence) {
>> + if (type == GUC_HXG_TYPE_RESPONSE_FAILURE)
>> + drm_err(&xe->drm, "FAST_REQUEST H2G failed! e=0x%x, h=%u\n",
>> + FIELD_GET(GUC_HXG_FAILURE_MSG_0_ERROR, msg[1]),
>> + FIELD_GET(GUC_HXG_FAILURE_MSG_0_HINT, msg[1]));
>> + else
>> + drm_err(&xe->drm, "unexpected response %u for FAST_REQUEST H2G!\n",
>> + type);
> we should at least use xe_gt_err() to get the GT id in the log
ok
Daniele
>
>> +
>> + return -EPROTO;
>> + }
>> +
>> lockdep_assert_held(&ct->lock);
>>
>> g2h_fence = xa_erase(&ct->fence_lookup, fence);
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> index d814d4ee3fc6..62f98dcaeaf1 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> @@ -99,7 +99,7 @@ struct xe_guc_ct {
>> /** @enabled: CT enabled */
>> bool enabled;
>> /** @fence_seqno: G2H fence seqno - 16 bits used by CT */
>> - u32 fence_seqno;
>> + u16 fence_seqno;
>> /** @fence_lookup: G2H fence lookup */
>> struct xarray fence_lookup;
>> /** @wq: wait queue used for reliable CT sends and freeing G2H credits */
More information about the Intel-xe
mailing list