[PATCH 3/3] drm/xe: Allow scratch page under fault mode for certain platform

Thu Feb 27 00:22:41 UTC 2025

On Wed, Feb 26, 2025 at 03:12:14PM -0700, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: February 25, 2025 5:11 PM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: intel-xe at lists.freedesktop.org;
> > Thomas.Hellstrom at linux.intel.com; Cavitt, Jonathan
> > <jonathan.cavitt at intel.com>
> > Subject: Re: [PATCH 3/3] drm/xe: Allow scratch page under fault
> > mode for certain platform
> > 
> > On Wed, Feb 12, 2025 at 09:23:31PM -0500, Oak Zeng wrote:
> > 
> > I replied to the wrong versions... Please generate the patches with:
> > 
> > git format-patch -v<n>
> > 
> > Where <n> is the version number. This will help avoiding replying to
> > the
> > wrong patch.
> > 
> > Copy / pasting my reply here...
> > 
> > > Normally scratch page is not allowed when a vm is operate under
> > page
> > > fault mode, i.e., in the existing codes,
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE
> > > and DRM_XE_VM_CREATE_FLAG_FAULT_MODE are mutual
> > exclusive. The reason
> > > is fault mode relies on recoverable page to work, while scratch page
> > > can mute recoverable page fault.
> > >
> > > On xe2 and xe3, out of bound prefetch can cause page fault and
> > further
> > > system hang because xekmd can't resolve such page fault. SYCL and
> > OCL
> > > language runtime requires out of bound prefetch to be silently
> > dropped
> > > without causing any functional problem, thus the existing behavior
> > > doesn't meet language runtime requirement.
> > >
> > > At the same time, HW prefetching can cause page fault interrupt.
> > Due to
> > > page fault interrupt overhead (i.e., need Guc and KMD involved to
> > fix
> > > the page fault), HW prefetching can be slowed by many orders of
> > magnitude.
> > >
> > > Fix those problems by allowing scratch page under fault mode for
> > xe2 and
> > > xe3. With scratch page in place, HW prefetching could always hit
> > scratch
> > > page instead of causing interrupt.
> > >
> > > A side effect is, scratch page could hide application program error.
> > > Application out of bound accesses are hided by scratch page
> > mapping,
> > > instead of get reported to user.
> > >
> > 
> > I'd include the IGT information in the cover letter, not the patch.
> > 
> > > igt test: https://patchwork.freedesktop.org/series/144334/. Test
> > result on
> > > BMG:
> > >
> > > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> > public/build/tests# ./xe_exec_fault_mode --run-subtest scratch-fault
> > > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > > Using IGT_SRANDOM=1738684805 for randomisation
> > > Opened device: /dev/dri/card0
> > > Starting subtest: scratch-fault
> > > Subtest scratch-fault: SUCCESS (0.080s)
> > >
> > > Without this series, the test result is:
> > >
> > > root at DUT1130BMGFRD:/home/szeng/dii-tools/igt-
> > public/build/tests# ./xe_exec_fault_mode --run-subtest scratch-fault
> > > IGT-Version: 1.30-gde1a3cb42 (x86_64) (Linux: 6.13.0-xe x86_64)
> > > Using IGT_SRANDOM=1738686046 for randomisation
> > > Opened device: /dev/dri/card0
> > > Starting subtest: scratch-fault
> > > (xe_exec_fault_mode:5047) CRITICAL: Test assertion failure
> > function test_exec, file ../tests/intel/xe_exec_fault_mode.c:349:
> > > (xe_exec_fault_mode:5047) CRITICAL: Failed assertion:
> > __xe_wait_ufence(fd, &exec_sync[i], 0xdeadbeefdeadbeefull,
> > exec_queues[i % n_exec_queues], &timeout) == 0
> > > (xe_exec_fault_mode:5047) CRITICAL: Last errno: 62, Timer expired
> > > (xe_exec_fault_mode:5047) CRITICAL: error: -62 != 0
> > > Stack trace:
> > >   #0 ../lib/igt_core.c:2266 __igt_fail_assert()
> > >   #1 ../tests/intel/xe_exec_fault_mode.c:346 test_exec()
> > >   #2 ../tests/intel/xe_exec_fault_mode.c:537
> > __igt_unique____real_main407()
> > >   #3 ../tests/intel/xe_exec_fault_mode.c:407 main()
> > >   #4 ../sysdeps/nptl/libc_start_call_main.h:74
> > __libc_start_call_main()
> > >   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
> > >   #6 [_start+0x2e]
> > > Subtest scratch-fault failed.
> > >
> > > v2: Refine commit message (Thomas)
> > >
> > > v3: Move the scratch page flag check to after scratch page wa
> > (Thomas)
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_vm.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 813d893d9b63..c2dfd0ade403 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -1766,7 +1766,8 @@ int xe_vm_create_ioctl(struct drm_device
> > *dev, void *data,
> > >  		return -EINVAL;
> > >
> > >  	if (XE_IOCTL_DBG(xe, args->flags &
> > DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE &&
> > > -			 args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE))
> > > +			 args->flags &
> > DRM_XE_VM_CREATE_FLAG_FAULT_MODE &&
> > > +			 !(NEEDS_SCRATCH(xe))))
> > 
> > Same comment as patch #1, I'd drop this macro.
> > 
> > Do we need a query uAPI so the UMD can test upon process open if
> > the VM
> > supports scratch page + faults?
> 
> Introducing a query API seems a little overkill to me. 
> 

This is really a question for the level0 UMD I suppose, I'd follow up
there. They have asked for a query uAPI for other seemly simple things
before.

Matt

> Maybe I can add a little comment in the xe_drm.h, saying scratch and fault
> Flags are normally mutual exclusive except on xe2/3?
> 
> 
> Or should we just not restrict VM
> > scratch page + faults ever and have it choose based on platform
> > recommnedation?
> > 
> 
> I think having above checking/restriction is still helpful.
> 
> Oak
> 
> > Matt
> > 
> > >  		return -EINVAL;
> > >
> > >  	if (XE_IOCTL_DBG(xe, !(args->flags &
> > DRM_XE_VM_CREATE_FLAG_LR_MODE) &&
> > > --
> > > 2.26.3
> > >