[PATCH] drm/xe: Unlink client during vm close

Fri Jul 19 05:08:42 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Thursday, July 18, 2024 9:28 PM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>
> Cc: intel-xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe: Unlink client during vm close
> 
> On Thu, Jul 18, 2024 at 06:47:52PM +0530, Tejas Upadhyay wrote:
> > We have async call which does not know if client unlinked from vm by
> > the time it is accessed. Set client unlink early during xe_vm_close()
> > so that async API do not touch closed client info.
> >
> > Also, debugs related to job timeout is not useful when its "no
> > process" or client already unlinked.
> >
> 
> It kernel exec queue timeout jobs, now the 'Timedout job' message will not
> be displayed which is not ideal.
> 
> > Fixes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2273
> 
> Where is exactly is this access coming from?
> BUG: kernel NULL pointer dereference, address: 0000000000000058

In guc_exec_queue_timedout_job() accessing "q->vm->xef->drm" after client closed fd causing crash. We cant take ref and keep client awake till jobs timedout is what I thought. 

> 
> Also btw, the correct tag for gitlab link is 'Closes', "Fixes' is the offending
> kernel patch so the fixe can be pulled into stable kernels.

Ok

> 
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 7 ++++---
> >  drivers/gpu/drm/xe/xe_vm.c         | 1 +
> >  2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 860405527115..1de141cb84c6 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1166,10 +1166,11 @@ guc_exec_queue_timedout_job(struct
> drm_sched_job *drm_job)
> >  			process_name = task->comm;
> >  			pid = task->pid;
> >  		}
> > +		xe_gt_notice(guc_to_gt(guc), "Timedout job: seqno=%u,
> lrc_seqno=%u, guc_id=%d, flags=0x%lx in %s [%d]",
> > +			     xe_sched_job_seqno(job),
> xe_sched_job_lrc_seqno(job),
> > +			     q->guc->id, q->flags, process_name, pid);
> >  	}
> > -	xe_gt_notice(guc_to_gt(guc), "Timedout job: seqno=%u,
> lrc_seqno=%u, guc_id=%d, flags=0x%lx in %s [%d]",
> > -		     xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
> > -		     q->guc->id, q->flags, process_name, pid);
> > +
> >  	if (task)
> >  		put_task_struct(task);
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index cf3aea5d8cdc..660b20e0e207 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -1537,6 +1537,7 @@ static void xe_vm_close(struct xe_vm *vm)  {
> >  	down_write(&vm->lock);
> >  	vm->size = 0;
> > +	vm->xef = NULL;
> 
> This doesn't appear to be thread safe.

Would you please elaborate!

Thanks,
Tejas
> 
> Matt
> 
> >  	up_write(&vm->lock);
> >  }
> >
> > --
> > 2.25.1
> >