[igt-dev] [Intel-xe] [PATCH i-g-t] tests/xe/xe_intel_bb: ensure valid next page

Wed Jun 7 09:55:08 UTC 2023

On Fri, Jun 02, 2023 at 12:48:17PM +0100, Matthew Auld wrote:
> Due to over-fetch, recommendation is to ensure we have a single valid
> extra page beyond the batch. We currently lack this which seems to
> explain why xe_intel_bb at full-batch generates CAT errors.
> 
> Currently we allow using the last GTT page, but this looks to be no-go,
> since the next page will be beyond the actual GTT, in the case of
> full-batch. The i915 path looks to already account for this. However
> even with that fixed, Xe doesn't use scratch pages by default so the
> next page will still not be valid.
> 
> With Xe rather expect that callers know about HW over-fetch, ensuring
> that the batch has an extra page, if needed. Alternatively we could
> apply the DRM_XE_VM_CREATE_SCRATCH_PAGE when creating the vm, but really
> we want to get away from such things.
> 
> Bspec: 60223
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/262
> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>

I observe this not only on the last page. I've introduced

index 3cd680072e..ef3f782df7 100644

--- a/lib/intel_batchbuffer.c
+++ b/lib/intel_batchbuffer.c
@@ -947,7 +947,7 @@ __intel_bb_create(int fd, uint32_t ctx, uint32_t vm, const intel_ctx_cfg_t *cfg,
 
                /* Limit to 48-bit due to MI_* address limitation */
                ibb->gtt_size = 1ull << min_t(uint32_t, xe_va_bits(fd), 48);
-               end = ibb->gtt_size;
+               end = ibb->gtt_size - xe_get_default_alignment(fd)*2;

or even:

+               end = ibb->gtt_size/2;

Having:

ADLP:~/igt-upstream/build/tests# ./xe_intel_bb --run full-batch --debug
IGT-Version: 1.27.1-NO-GIT (x86_64) (Linux: 6.3.0-xe+ x86_64)
Opened device: /dev/dri/card0
(xe_intel_bb:11420) drmtest-DEBUG: Test requirement passed: !(fd<0)
(xe_intel_bb:11420) intel_bufops-DEBUG: generation: 12, supported tiles: 0x3f, driver: xe
Starting subtest: full-batch
(xe_intel_bb:11420) intel_allocator_simple-DEBUG: Using simple allocator
(xe_intel_bb:11420) intel_batchbuffer-DEBUG: Run on DRM_XE_ENGINE_CLASS_COPY
(xe_intel_bb:11420) intel_batchbuffer-DEBUG: bind: MAP
(xe_intel_bb:11420) intel_batchbuffer-DEBUG:   handle: 1, offset: ffffffffd000, size: 1000
(xe_intel_bb:11420) intel_batchbuffer-DEBUG: bind: UNMAP
(xe_intel_bb:11420) intel_batchbuffer-DEBUG:   offset: ffffffffd000, size: 1000

And I still observe failure, what means problem is with prefetching from
next page, regardless location.

That means imo umd and any consumer have to overallocate to avoid
hitting:

[ 1406.062731] xe 0000:00:02.0: [drm] Engine memory cat error: guc_id=2
[ 1406.070519] xe 0000:00:02.0: [drm] Timedout job: seqno=4294967169, guc_id=2, flags=0x4
[ 1406.070535] xe 0000:00:02.0: [drm:xe_devcoredump [xe]] Multiple hangs are occurring, but only the first snapshot was taken
[ 1406.071182] xe 0000:00:02.0: [drm] Engine reset: guc_id=2

So from my perspective you've avoided failure but that wasn't my intention
when I've written this test. Do you maybe know does umd has mitigation
code which prevents of entering area which starts prefetching from next
page?

--
Zbigniew


> ---
>  lib/intel_batchbuffer.c | 6 ++++++
>  tests/xe/xe_intel_bb.c  | 8 +++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/intel_batchbuffer.c b/lib/intel_batchbuffer.c
> index 3cd680072..035facfc4 100644
> --- a/lib/intel_batchbuffer.c
> +++ b/lib/intel_batchbuffer.c
> @@ -881,6 +881,12 @@ static inline uint64_t __intel_bb_get_offset(struct intel_bb *ibb,
>   * passed in. If this is the case, it copies the information over to the
>   * newly created batch buffer.
>   *
> + * NOTE: On Xe scratch pages are not used by default. Due to over-fetch (~512
> + * bytes) there might need to be a valid next page to avoid hangs or CAT errors
> + * if the batch is quite large and approaches the end boundary of the batch
> + * itself.  Inflate the @size to ensure there is a valid next page in such
> + * cases.
> + *
>   * Returns:
>   *
>   * Pointer the intel_bb, asserts on failure.
> diff --git a/tests/xe/xe_intel_bb.c b/tests/xe/xe_intel_bb.c
> index 755cc530e..af8462af5 100644
> --- a/tests/xe/xe_intel_bb.c
> +++ b/tests/xe/xe_intel_bb.c
> @@ -952,7 +952,13 @@ static void full_batch(struct buf_ops *bops)
>  	struct intel_bb *ibb;
>  	int i;
>  
> -	ibb = intel_bb_create(xe, PAGE_SIZE);
> +	/*
> +	 * Add an extra page to ensure over-fetch always sees a valid next page,
> +	 * which includes not going beyond the actual GTT, and ensuring we have
> +	 * a valid GTT entry, given that on xe we don't use scratch pages by
> +	 * default.
> +	 */
> +	ibb = intel_bb_create(xe, 2 * PAGE_SIZE);
>  	if (debug_bb)
>  		intel_bb_set_debug(ibb, true);
>  
> -- 
> 2.40.1
>