[PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform

Thu Feb 6 15:14:36 UTC 2025

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> Sent: February 6, 2025 4:30 AM
> To: Zeng, Oak <oak.zeng at intel.com>; intel-xe at lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost at intel.com>; Cavitt, Jonathan
> <jonathan.cavitt at intel.com>
> Subject: Re: [PATCH 3/3] drm/xe: Allow scratch page under fault
> mode for certain platform
> 
> On Thu, 2025-02-06 at 01:54 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > Sent: February 5, 2025 8:14 AM
> > > To: Zeng, Oak <oak.zeng at intel.com>; intel-
> xe at lists.freedesktop.org
> > > Cc: Brost, Matthew <matthew.brost at intel.com>; Cavitt, Jonathan
> > > <jonathan.cavitt at intel.com>
> > > Subject: Re: [PATCH 3/3] drm/xe: Allow scratch page under fault
> > > mode for certain platform
> > >
> > > On Tue, 2025-02-04 at 13:45 -0500, Oak Zeng wrote:
> > > > Normally scratch page is not allowed when a vm is operate
> under
> > > page
> > > > fault mode, i.e., in the existing codes,
> > > > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> > > > and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual
> > > exclusive. The reason
> > > > is fault mode relies on recoverable page to work, while scratch
> > > > page
> > > > can mute recoverable page fault.
> > > >
> > > > On xe2 and xe3, out of bound prefetch can cause page fault and
> > > > further
> > > > system hang because xekmd can't resolve such page fault. SYCL
> and
> > > OCL
> > > > language runtime requires out of bound prefetch to be silently
> > > > dropped
> > > > without causing any functional problem, thus the existing
> > > > behavior
> > > > doesn't meet language runtime requirement.
> > > >
> > > > At the same time, HW prefetching can cause page fault interrupt.
> > > Due
> > > > to
> > > > page fault interrupt overhead (i.e., need Guc and KMD involved
> to
> > > fix
> > > > the page fault), HW prefetching can be slowed by many orders
> of
> > > > magnitude.
> > > >
> > > > Fix those problems by allowing scratch page under fault mode
> for
> > > xe2
> > > > and
> > > > xe3. With scratch page in place, HW prefetching could always hit
> > > > scratch
> > > > page instead of causing interrupt.
> > > >
> > > > A side effect is, scratch page could hide application program
> > > > error.
> > > > Application out of bound accesses are hided
> > > s/hided/hidden/
> > >
> > > >  by scratch page mapping,
> > > > instead of get reported to user.
> > > >
> > > > igt test: https://patchwork.freedesktop.org/series/144334/.
> Test
> > > > result on
> > > > BMG:
> > > >
> > > > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> > > public/build/tests#
> > > > ./xe_exec_fault_mode --run-subtest scratch-fault
> > > > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > > > Using IGT_SRANDOM=1738684805 for randomisation
> > > > Opened device: /dev/dri/card0
> > > > Starting subtest: scratch-fault
> > > > Subtest scratch-fault: SUCCESS (0.080s)
> > > >
> > > > Without this series, the test result is:
> > > >
> > > > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> > > public/build/tests#
> > > > ./xe_exec_fault_mode --run-subtest scratch-fault
> > > > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > > > Using IGT_SRANDOM=1738686046 for randomisation
> > > > Opened device: /dev/dri/card0
> > > > Starting subtest: scratch-fault
> > > > (xe_exec_fault_mode:5047) CRITICAL: Test assertion failure
> > > function
> > > > test_exec, file ../tests/intel/xe_exec_fault_mode.c:349:
> > > > (xe_exec_fault_mode:5047) CRITICAL: Failed assertion:
> > > > __xe_wait_ufence(fd, &exec_sync[i], 0xdeadbeefdeadbeefull,
> > > > exec_queues[i % n_exec_queues], &timeout) == 0
> > > > (xe_exec_fault_mode:5047) CRITICAL: Last errno: 62, Timer
> expired
> > > > (xe_exec_fault_mode:5047) CRITICAL: error: -62 != 0
> > > > Stack trace:
> > > >   #0 ../lib/igt_core.c:2266 __igt_fail_assert()
> > > >   #1 ../tests/intel/xe_exec_fault_mode.c:346 test_exec()
> > > >   #2 ../tests/intel/xe_exec_fault_mode.c:537
> > > > __igt_unique____real_main407()
> > > >   #3 ../tests/intel/xe_exec_fault_mode.c:407 main()
> > > >   #4 ../sysdeps/nptl/libc_start_call_main.h:74
> > > > __libc_start_call_main()
> > > >   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
> > > >   #6 [_start+0x2e]
> > > > Subtest scratch-fault failed.
> > > >
> > > > v2: Refine commit message (Thomas)
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_vm.c | 9 +++++----
> > > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 813d893d9b63..c0372f083d42 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -1752,6 +1752,11 @@ int xe_vm_create_ioctl(struct
> > > drm_device *dev,
> > > > void *data,
> > > >  	if (XE_IOCTL_DBG(xe, args->extensions))
> > > >  		return -EINVAL;
> > > >
> > > > +	if (XE_IOCTL_DBG(xe, args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > > > +			 args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> > > > +			 !(NEEDS_SCRATCH(xe))))
> > > > +		return -EINVAL;
> > > > +
> > >
> > > We should probably move this test to where the old test were
> below,
> > > since the WA below enables scratch pages.
> >
> >
> > Below wa code is for dg2.
> > If we move this test below the wa code, then on dg2, if people
> create
> > vm
> > With FAULT_MODE | LR_MODE, above check would fail user.
> 
> Though on DG2 we don't support fault mode.
> 
> >
> > So I do think we should keep this check above wa codes.
> 
> I think what matters here is readability of the code. A naive reader
> (like me in this case) would start to wonder why enabling scratch
> pages
> *after* the sanity check for them is done, and whether that is a bug
> or
> an oversight, and that would require looking up what the WA actually
> means.
> 
> So IMO in general sanity checks should be done after all implicit
> enabling of flags if at all possible. And if not, they should be
> accompanied by a comment that explains why the enabling of the flag
> is
> not included in the sanity check.

Ok, make sense to me. Will move the check below the wa then.

Oak

> 
> /Thomas
> 
> 
> >
> > Oak
> >
> >
> > >
> > > /Thomas
> > >
> > >
> > > >  	if (XE_WA(xe_root_mmio_gt(xe), 14016763929))
> > > >  		args->flags |=
> > > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE;
> > > >
> > > > @@ -1765,10 +1770,6 @@ int xe_vm_create_ioctl(struct
> > > drm_device *dev,
> > > > void *data,
> > > >  	if (XE_IOCTL_DBG(xe, args->flags &
> > > > ~ALL_DRM_XE_VM_CREATE_FLAGS))
> > > >  		return -EINVAL;
> > > >
> > > > -	if (XE_IOCTL_DBG(xe, args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > > > -			 args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > > > -		return -EINVAL;
> > > > -
> > > >  	if (XE_IOCTL_DBG(xe, !(args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
> > > >  			 args->flags &
> > > > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > > >  		return -EINVAL;
> >