[Intel-xe] [PATCH] drm/xe: fix tlb_invalidation_seqno_past()

Matthew Auld matthew.auld at intel.com
Wed May 10 10:10:08 UTC 2023


On 08/05/2023 04:42, Christopher Snowhill wrote:
> Why is XE_FENCE_INITIAL_SEQNO defined to (-127) when seqno variable
> types are mostly unsigned everywhere, except for logic that is
> checking for wraps?

It's just so we can more likely trigger a u32 seqno wrap, which maybe 
uncovers issues under normal CI runs. (u32)(-127) means we should wrap 
in 127 seqnos or so.

> 
> On Sun, May 7, 2023 at 8:26 PM Christopher Snowhill <kode54 at gmail.com> wrote:
>>
>> False alarm. Looks like there's a different seqno overflow somewhere in here:
>>
>> [  854.230925] xe 0000:28:00.0: [drm] Engine reset: guc_id=59
>> [  854.236564] xe 0000:28:00.0: [drm] Timedout job: seqno=4294967169,
>> guc_id=59, flags=0x8
>> [  859.885349] xe 0000:28:00.0: [drm] Ioctl argument check failed at
>> drivers/gpu/drm/xe/xe_exec.c:178: engine->flags & ENGINE_FLAG_BANNED
>> [  859.885366] xe 0000:28:00.0: [drm] Ioctl argument check failed at
>> drivers/gpu/drm/xe/xe_exec.c:178: engine->flags & ENGINE_FLAG_BANNED
>>
>> Also, curses to gmail for not defaulting to plain text mode.
>>
>>
>> On Sun, May 7, 2023 at 8:19 PM Christopher Snowhill <kode54 at gmail.com> wrote:
>>>
>>> Wow, this patch made intel-compute-runtime suddenly start working properly instead of causing a "GPU hang" that wasn't really a hang but instead a seqno overflow.
>>>
>>> On Sun, May 7, 2023 at 5:53 PM Matthew Brost <matthew.brost at intel.com> wrote:
>>>>
>>>> On Fri, May 05, 2023 at 03:49:10PM +0100, Matthew Auld wrote:
>>>>> Checking seqno_recv >= seqno looks like it will incorrectly report true
>>>>> when the seqno has wrapped (not unlikely given
>>>>> TLB_INVALIDATION_SEQNO_MAX). Calling xe_gt_tlb_invalidation_wait() might
>>>>> then return before the flush has been completed by the GuC.
>>>>>
>>>>> Fix this by treating a large negative delta as an indication that the
>>>>> seqno has wrapped around. Similar to how we treat a large positive delta
>>>>> as an indication that the seqno_recv must have wrapped around, but in
>>>>> that case the seqno has likely also signalled.
>>>>>
>>>>> It looks like we could also potentially make the seqno use the full
>>>>> 32bits as supported by the GuC.
>>>>
>>>> Yea we def could use more of the space but in the end we have the seqno
>>>> wrap issue. I think I set this to a low value to prove the wrapping
>>>> protection worked (it didn't) by triigering wraps more often than the
>>>> wrapping 32 bits.
>>>>
>>>> With, this patch LGTM.
>>>>
>>>> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
>>>>
>>>>>
>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 7 ++++---
>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>>>>> index 604f189dbd70..67822b3dd353 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>>>>> @@ -251,14 +251,15 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
>>>>>
>>>>>   static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int seqno)
>>>>>   {
>>>>> -     if (gt->tlb_invalidation.seqno_recv >= seqno)
>>>>> -             return true;
>>>>> +     if (seqno - gt->tlb_invalidation.seqno_recv <
>>>>> +         -(TLB_INVALIDATION_SEQNO_MAX / 2))
>>>>> +             return false;
>>>>>
>>>>>        if (seqno - gt->tlb_invalidation.seqno_recv >
>>>>>            (TLB_INVALIDATION_SEQNO_MAX / 2))
>>>>>                return true;
>>>>>
>>>>> -     return false;
>>>>> +     return gt->tlb_invalidation.seqno_recv >= seqno;
>>>>>   }
>>>>>
>>>>>   /**
>>>>> --
>>>>> 2.40.0
>>>>>


More information about the Intel-xe mailing list