[igt-dev] [Intel-xe] [PATCH i-g-t] tests/xe/xe_intel_bb: ensure valid next page

Wed Jun 7 10:27:41 UTC 2023

On 07/06/2023 10:55, Zbigniew Kempczyński wrote:
> On Fri, Jun 02, 2023 at 12:48:17PM +0100, Matthew Auld wrote:
>> Due to over-fetch, recommendation is to ensure we have a single valid
>> extra page beyond the batch. We currently lack this which seems to
>> explain why xe_intel_bb at full-batch generates CAT errors.
>>
>> Currently we allow using the last GTT page, but this looks to be no-go,
>> since the next page will be beyond the actual GTT, in the case of
>> full-batch. The i915 path looks to already account for this. However
>> even with that fixed, Xe doesn't use scratch pages by default so the
>> next page will still not be valid.
>>
>> With Xe rather expect that callers know about HW over-fetch, ensuring
>> that the batch has an extra page, if needed. Alternatively we could
>> apply the DRM_XE_VM_CREATE_SCRATCH_PAGE when creating the vm, but really
>> we want to get away from such things.
>>
>> Bspec: 60223
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/262
>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> 
> I observe this not only on the last page. I've introduced
> 
> index 3cd680072e..ef3f782df7 100644
> --- a/lib/intel_batchbuffer.c
> +++ b/lib/intel_batchbuffer.c
> @@ -947,7 +947,7 @@ __intel_bb_create(int fd, uint32_t ctx, uint32_t vm, const intel_ctx_cfg_t *cfg,
>   
>                  /* Limit to 48-bit due to MI_* address limitation */
>                  ibb->gtt_size = 1ull << min_t(uint32_t, xe_va_bits(fd), 48);
> -               end = ibb->gtt_size;
> +               end = ibb->gtt_size - xe_get_default_alignment(fd)*2;
> 
> or even:
> 
> +               end = ibb->gtt_size/2;
> 
> Having:
> 
> ADLP:~/igt-upstream/build/tests# ./xe_intel_bb --run full-batch --debug
> IGT-Version: 1.27.1-NO-GIT (x86_64) (Linux: 6.3.0-xe+ x86_64)
> Opened device: /dev/dri/card0
> (xe_intel_bb:11420) drmtest-DEBUG: Test requirement passed: !(fd<0)
> (xe_intel_bb:11420) intel_bufops-DEBUG: generation: 12, supported tiles: 0x3f, driver: xe
> Starting subtest: full-batch
> (xe_intel_bb:11420) intel_allocator_simple-DEBUG: Using simple allocator
> (xe_intel_bb:11420) intel_batchbuffer-DEBUG: Run on DRM_XE_ENGINE_CLASS_COPY
> (xe_intel_bb:11420) intel_batchbuffer-DEBUG: bind: MAP
> (xe_intel_bb:11420) intel_batchbuffer-DEBUG:   handle: 1, offset: ffffffffd000, size: 1000
> (xe_intel_bb:11420) intel_batchbuffer-DEBUG: bind: UNMAP
> (xe_intel_bb:11420) intel_batchbuffer-DEBUG:   offset: ffffffffd000, size: 1000
> 
> And I still observe failure, what means problem is with prefetching from
> next page, regardless location.

Yeah, AFAICT the CAT error will trigger for any overfetch that hits an 
bogus GTT address. Either because it goes beyond the GTT boundary or 
because the PTE is marked as invalid and we aren't able to fault it in it.

> 
> That means imo umd and any consumer have to overallocate to avoid
> hitting:
> 
> [ 1406.062731] xe 0000:00:02.0: [drm] Engine memory cat error: guc_id=2
> [ 1406.070519] xe 0000:00:02.0: [drm] Timedout job: seqno=4294967169, guc_id=2, flags=0x4
> [ 1406.070535] xe 0000:00:02.0: [drm:xe_devcoredump [xe]] Multiple hangs are occurring, but only the first snapshot was taken
> [ 1406.071182] xe 0000:00:02.0: [drm] Engine reset: guc_id=2
> 
> So from my perspective you've avoided failure but that wasn't my intention
> when I've written this test. Do you maybe know does umd has mitigation
> code which prevents of entering area which starts prefetching from next
> page?

Userspace should be aware of such HW behaviour I think. In Mesa + Xe I 
think it still currently forces scratch pages, but AFAIK they want to 
get rid of that once they are sure that all potential overfetch can't 
trigger CAT errors or similar. If the vm doesn't have scratch pages, 
then I think it's up to userspace to deal with overfetch, and I think 
that must involve inflating the batch size or somehow ensuring the next 
page always has something valid bound.

Do you know if this test is specifically trying to poke at the overfetch 
behaviour, or does this test just want a PAGE_SIZE worth of NOOPS + 
BB_END? I wasn't sure given that currently on DG2/ATS-M the batch is 
always going to be 64K underneath, and so the overfetch is never going 
to find issues on such platforms.

If the test cares specifically about triggering overfetch we could maybe 
enable scratch pages for the vm in full-batch, and on DG2/ATS-M also do 
s/PAGE_SIZE/64K/ or use system memory for the batch. And then ofc ensure 
the last GTT page is never given out. If we later decide to remove 
scratch pages completely in Xe then I guess we just remove the test. 
What do you think?

> 
> --
> Zbigniew
> 
> 
>> ---
>>   lib/intel_batchbuffer.c | 6 ++++++
>>   tests/xe/xe_intel_bb.c  | 8 +++++++-
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/intel_batchbuffer.c b/lib/intel_batchbuffer.c
>> index 3cd680072..035facfc4 100644
>> --- a/lib/intel_batchbuffer.c
>> +++ b/lib/intel_batchbuffer.c
>> @@ -881,6 +881,12 @@ static inline uint64_t __intel_bb_get_offset(struct intel_bb *ibb,
>>    * passed in. If this is the case, it copies the information over to the
>>    * newly created batch buffer.
>>    *
>> + * NOTE: On Xe scratch pages are not used by default. Due to over-fetch (~512
>> + * bytes) there might need to be a valid next page to avoid hangs or CAT errors
>> + * if the batch is quite large and approaches the end boundary of the batch
>> + * itself.  Inflate the @size to ensure there is a valid next page in such
>> + * cases.
>> + *
>>    * Returns:
>>    *
>>    * Pointer the intel_bb, asserts on failure.
>> diff --git a/tests/xe/xe_intel_bb.c b/tests/xe/xe_intel_bb.c
>> index 755cc530e..af8462af5 100644
>> --- a/tests/xe/xe_intel_bb.c
>> +++ b/tests/xe/xe_intel_bb.c
>> @@ -952,7 +952,13 @@ static void full_batch(struct buf_ops *bops)
>>   	struct intel_bb *ibb;
>>   	int i;
>>   
>> -	ibb = intel_bb_create(xe, PAGE_SIZE);
>> +	/*
>> +	 * Add an extra page to ensure over-fetch always sees a valid next page,
>> +	 * which includes not going beyond the actual GTT, and ensuring we have
>> +	 * a valid GTT entry, given that on xe we don't use scratch pages by
>> +	 * default.
>> +	 */
>> +	ibb = intel_bb_create(xe, 2 * PAGE_SIZE);
>>   	if (debug_bb)
>>   		intel_bb_set_debug(ibb, true);
>>   
>> -- 
>> 2.40.1
>>