[PATCH] drm/xe: Avoid evicting object of the same vm in none fault mode

Fri Dec 6 15:53:59 UTC 2024

On Fri, Dec 06, 2024 at 03:19:30PM +0000, Zeng, Oak wrote:
> Hi Rodrigo,
> 
> > -----Original Message-----
> > From: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> > Sent: December 6, 2024 9:46 AM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: intel-xe at lists.freedesktop.org;
> > Thomas.Hellstrom at linux.intel.com
> > Subject: Re: [PATCH] drm/xe: Avoid evicting object of the same vm in
> > none fault mode
> > 
> > On Mon, Dec 02, 2024 at 09:19:29PM -0500, Oak Zeng wrote:
> > > BO validation during vm_bind could trigger memory eviction when
> > > system runs under memory pressure. Right now we blindly evict
> > > BOs of all VMs. This scheme has a problem when system runs in
> > > none recoverable page fault mode: even though the vm_bind could
> > > be successful by evicting BOs, the later the rebinding of the
> > > evicted BOs would fail. So it is better to report an out-of-
> > > memory failure at vm_bind time than at time of rebinding where
> > > xekmd currently doesn't have a good mechanism to report error
> > > to user space.
> > >
> > > This patch implemented a scheme to only evict objects of other
> > > VMs during vm_bind time. Object of the same VM will skip eviction.
> > > If we failed to find enough memory for vm_bind, we report error
> > > to user space at vm_bind time.
> > >
> > > This scheme is not needed for recoverable page fault mode under
> > > what we can dynamically fault-in pages on demand.
> > >
> > > v1: Use xe_vm_in_preempt_fence_mode instead of stack variable
> > (Thomas)
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > Suggested-by: Thomas Hellström
> > <thomas.hellstrom at linux.intel.com>
> > > Reviewed-by: Thomas Hellström
> > <thomas.hellstrom at linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_vm.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 2492750505d69..016fedae5d554 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -2359,13 +2359,15 @@ static int vma_lock_and_validate(struct
> > drm_exec *exec, struct xe_vma *vma,
> > >  				 bool validate)
> > >  {
> > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > 
> > for single usage like below we should avoid variable declaration,
> > specially in
> > this case where you do have line space left there... then it gets even
> > clear
> > that the only change in the patch is
> > 
> > - true
> > + !xe_vm_in_preempt_fence_mode(vm)
> 
> Are you suggesting to remove above line "+	struct xe_vm *vm = xe_vma_vm(vma);"?
> 
> Vm variable is used twice below, not single.
> 
> If I remove vm variable, then below I should write:
> 
> Xe_bo_validate(bo, xe_vma_vm(vma), !xe_vm_in_preempt_fence_mode(xe_vma_vm(vma)));

although I liked the 3 usages case like Matt Roper was consistently trying to
apply, when looking to this line like this, I do prefer your original version...

Please ignore me... I'm merging this patch as is right now...

Sorry and Thanks

> 
> Is this what you are suggesting?
> 
> Oak
> 
> > 
> > >  	int err = 0;
> > >
> > >  	if (bo) {
> > >  		if (!bo->vm)
> > >  			err = drm_exec_lock_obj(exec, &bo-
> > >ttm.base);
> > >  		if (!err && validate)
> > > -			err = xe_bo_validate(bo, xe_vma_vm(vma),
> > true);
> > > +			err = xe_bo_validate(bo, vm,
> > > +
> > 	     !xe_vm_in_preempt_fence_mode(vm));
> > >  	}
> > >
> > >  	return err;
> > > --
> > > 2.26.3
> > >