[PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform
Zeng, Oak
oak.zeng at intel.com
Thu Feb 6 01:54:49 UTC 2025
> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> Sent: February 5, 2025 8:14 AM
> To: Zeng, Oak <oak.zeng at intel.com>; intel-xe at lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost at intel.com>; Cavitt, Jonathan
> <jonathan.cavitt at intel.com>
> Subject: Re: [PATCH 3/3] drm/xe: Allow scratch page under fault
> mode for certain platform
>
> On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > Normally scratch page is not allowed when a vm is operate under
> page
> > fault mode, i.e., in the existing codes,
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> > and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual
> exclusive. The reason
> > is fault mode relies on recoverable page to work, while scratch page
> > can mute recoverable page fault.
> >
> > On xe2 and xe3, out of bound prefetch can cause page fault and
> > further
> > system hang because xekmd can't resolve such page fault. SYCL and
> OCL
> > language runtime requires out of bound prefetch to be silently
> > dropped
> > without causing any functional problem, thus the existing behavior
> > doesn't meet language runtime requirement.
> >
> > At the same time, HW prefetching can cause page fault interrupt.
> Due
> > to
> > page fault interrupt overhead (i.e., need Guc and KMD involved to
> fix
> > the page fault), HW prefetching can be slowed by many orders of
> > magnitude.
> >
> > Fix those problems by allowing scratch page under fault mode for
> xe2
> > and
> > xe3. With scratch page in place, HW prefetching could always hit
> > scratch
> > page instead of causing interrupt.
> >
> > A side effect is, scratch page could hide application program error.
> > Application out of bound accesses are hided
> s/hided/hidden/
>
> > by scratch page mapping,
> > instead of get reported to user.
> >
> > igt test: https://patchwork.freedesktop.org/series/144334/. Test
> > result on
> > BMG:
> >
> > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> public/build/tests#
> > ./xe_exec_fault_mode --run-subtest scratch-fault
> > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > Using IGT_SRANDOM=1738684805 for randomisation
> > Opened device: /dev/dri/card0
> > Starting subtest: scratch-fault
> > Subtest scratch-fault: SUCCESS (0.080s)
> >
> > Without this series, the test result is:
> >
> > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> public/build/tests#
> > ./xe_exec_fault_mode --run-subtest scratch-fault
> > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > Using IGT_SRANDOM=1738686046 for randomisation
> > Opened device: /dev/dri/card0
> > Starting subtest: scratch-fault
> > (xe_exec_fault_mode:5047) CRITICAL: Test assertion failure
> function
> > test_exec, file ../tests/intel/xe_exec_fault_mode.c:349:
> > (xe_exec_fault_mode:5047) CRITICAL: Failed assertion:
> > __xe_wait_ufence(fd, &exec_sync[i], 0xdeadbeefdeadbeefull,
> > exec_queues[i % n_exec_queues], &timeout) == 0
> > (xe_exec_fault_mode:5047) CRITICAL: Last errno: 62, Timer expired
> > (xe_exec_fault_mode:5047) CRITICAL: error: -62 != 0
> > Stack trace:
> > #0 ../lib/igt_core.c:2266 __igt_fail_assert()
> > #1 ../tests/intel/xe_exec_fault_mode.c:346 test_exec()
> > #2 ../tests/intel/xe_exec_fault_mode.c:537
> > __igt_unique____real_main407()
> > #3 ../tests/intel/xe_exec_fault_mode.c:407 main()
> > #4 ../sysdeps/nptl/libc_start_call_main.h:74
> > __libc_start_call_main()
> > #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
> > #6 [_start+0x2e]
> > Subtest scratch-fault failed.
> >
> > v2: Refine commit message (Thomas)
> >
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
> > 1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> b/drivers/gpu/drm/xe/xe_vm.c
> > index 813d893d9b63..c0372f083d42 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1752,6 +1752,11 @@ int xe_vm_create_ioctl(struct
> drm_device *dev,
> > void *data,
> > if (XE_IOCTL_DBG(xe, args->extensions))
> > return -EINVAL;
> >
> > + if (XE_IOCTL_DBG(xe, args->flags &
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > + args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> > + !(NEEDS_SCRATCH(xe))))
> > + return -EINVAL;
> > +
>
> We should probably move this test to where the old test were below,
> since the WA below enables scratch pages.
Below wa code is for dg2.
If we move this test below the wa code, then on dg2, if people create vm
With FAULT_MODE | LR_MODE, above check would fail user.
So I do think we should keep this check above wa codes.
Oak
>
> /Thomas
>
>
> > if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
> > args->flags |=
> DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
> >
> > @@ -1765,10 +1770,6 @@ int xe_vm_create_ioctl(struct
> drm_device *dev,
> > void *data,
> > if (XE_IOCTL_DBG(xe, args->flags &
> > ~ALL_DRM_XE_VM_CREATE_FLAGS))
> > return -EINVAL;
> >
> > - if (XE_IOCTL_DBG(xe, args->flags &
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > - args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > - return -EINVAL;
> > -
> > if (XE_IOCTL_DBG(xe, !(args->flags &
> > DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
> > args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > return -EINVAL;
More information about the Intel-xe
mailing list