[PATCH v3] drm/i915/gvt: add a spin_lock to protect the current_workload

Thu Feb 16 12:54:09 UTC 2017

> -----Original Message-----
> From: Zhenyu Wang [mailto:zhenyuw at linux.intel.com]
> Sent: Thursday, February 16, 2017 6:43 PM
> To: Dong, Chuanxiao <chuanxiao.dong at intel.com>
> Cc: intel-gvt-dev at lists.freedesktop.org
> Subject: Re: [PATCH v3] drm/i915/gvt: add a spin_lock to protect the
> current_workload
> 
> On 2017.02.16 16:43:27 +0800, Chuanxiao Dong wrote:
> > There is a corner case which can cause kernel panic after a GPU reset.
> > The reason for this kernel panic is that, sometimes there is still a
> > context switch comes from HW to notify SW for a failed workload. But
> > as it is a failed workload, GVT has already freed this workload and
> > the current_workload in scheduler is NULL. Accessing NULL caused
> > kernel panic. So this issue is caused by unsynchronized between the
> > workload free and unexpected notification handling.
> >
> > To protect the current_workload, add a spin_lock. So it can provide a
> > chance for SW to synchronize with each other to avoid such kind of
> > conflict.
> >
> > v2: update the commit message;
> >     add EINPROGRESS check for workload->status
> >
> > v3: make the status check gracefully
> >
> > Signed-off-by: Chuanxiao Dong <chuanxiao.dong at intel.com>
> > ---
> >  drivers/gpu/drm/i915/gvt/scheduler.c | 30
> > ++++++++++++++++++++++++++++--  drivers/gpu/drm/i915/gvt/scheduler.h
> |
> > 2 ++
> >  2 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c
> > b/drivers/gpu/drm/i915/gvt/scheduler.c
> > index d6b6d0e..64557b9 100644
> > --- a/drivers/gpu/drm/i915/gvt/scheduler.c
> > +++ b/drivers/gpu/drm/i915/gvt/scheduler.c
> > @@ -136,8 +136,30 @@ static int shadow_context_status_change(struct
> notifier_block *nb,
> >  		(struct drm_i915_gem_request *)data;
> >  	struct intel_gvt_workload_scheduler *scheduler =
> >  		&vgpu->gvt->scheduler;
> > -	struct intel_vgpu_workload *workload =
> > -		scheduler->current_workload[req->engine->id];
> > +	struct intel_vgpu_workload *workload;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&scheduler->cur_workload_lock, flags);
> > +	workload = scheduler->current_workload[req->engine->id];
> > +	/* Sometimes a failed workload still has this notification
> > +	 * from HW side.
> > +	 * So at this point, the failed workload may
> > +	 * already been freed. In this case, do nothing. Even the
> > +	 * failed workload is not freed yet, still do nothing.
> > +	 *
> > +	 * Another case, when we goes here, this workload is actually
> > +	 * a new workload already and we cannot do the context schedule
> > +	 * out based on the last failed one.
> > +	 */
> > +	if (unlikely(!workload ||
> 
> My previous point is that if to fix NULL pointer reference oops, just check
> workload as above should be enough. And looking more into this seems like
> we don't align request complete with its workload properly that might lead
> to inconsistent state, e.g the current_workload of scheduler might not
> belong to notified request. And we don't have sane handling of error
> workload, e.g might be already freed but later got notification on it. We
> should manage workload lifecycle better aligned with request fence.

Thanks Zhenyu for the suggestion. Just had a discussion with Zhi, currently the i915 driver has a reply mechanism after a GPU reset which made the gvt workload lifecycle not aligned with request. So to better manage the workload lifecycle with the request fence may also need to investigate from i915 side as well. This will take some time. So for a short term solution to resolve the kernel NULL pointer, we just need to check the workload. Do you agree to use this short term solution to resolve the kernel NULL pointer issue?

> 
> > +		((workload->status && (workload->status != -EINPROGRESS))
> ||
> > +		 vgpu->resetting) ||
> > +		(action == INTEL_CONTEXT_SCHEDULE_OUT &&
> > +		!atomic_read(&workload->shadow_ctx_active)))) {
> > +		spin_unlock_irqrestore(&scheduler->cur_workload_lock,
> flags);
> > +		return NOTIFY_OK;
> > +	}
> > +	spin_unlock_irqrestore(&scheduler->cur_workload_lock, flags);
> >
> >  	switch (action) {
> >  	case INTEL_CONTEXT_SCHEDULE_IN:
> > @@ -353,6 +375,7 @@ static void complete_current_workload(struct
> intel_gvt *gvt, int ring_id)
> >  	struct intel_vgpu_workload *workload;
> >  	struct intel_vgpu *vgpu;
> >  	int event;
> > +	unsigned long flags;
> >
> >  	mutex_lock(&gvt->lock);
> >
> > @@ -373,7 +396,9 @@ static void complete_current_workload(struct
> intel_gvt *gvt, int ring_id)
> >  	gvt_dbg_sched("ring id %d complete workload %p status %d\n",
> >  			ring_id, workload, workload->status);
> >
> > +	spin_lock_irqsave(&scheduler->cur_workload_lock, flags);
> >  	scheduler->current_workload[ring_id] = NULL;
> > +	spin_unlock_irqrestore(&scheduler->cur_workload_lock, flags);
> >
> >  	list_del_init(&workload->list);
> >  	workload->complete(workload);
> > @@ -514,6 +539,7 @@ int intel_gvt_init_workload_scheduler(struct
> > intel_gvt *gvt)
> >
> >  	gvt_dbg_core("init workload scheduler\n");
> >
> > +	spin_lock_init(&scheduler->cur_workload_lock);
> >  	init_waitqueue_head(&scheduler->workload_complete_wq);
> >
> >  	for (i = 0; i < I915_NUM_ENGINES; i++) { diff --git
> > a/drivers/gpu/drm/i915/gvt/scheduler.h
> > b/drivers/gpu/drm/i915/gvt/scheduler.h
> > index 2833dfa..b446ea9 100644
> > --- a/drivers/gpu/drm/i915/gvt/scheduler.h
> > +++ b/drivers/gpu/drm/i915/gvt/scheduler.h
> > @@ -40,6 +40,8 @@ struct intel_gvt_workload_scheduler {
> >  	struct intel_vgpu *current_vgpu;
> >  	struct intel_vgpu *next_vgpu;
> >  	struct intel_vgpu_workload
> *current_workload[I915_NUM_ENGINES];
> > +	/* For current_workload handling sync */
> > +	spinlock_t cur_workload_lock;
> >  	bool need_reschedule;
> >
> >  	wait_queue_head_t workload_complete_wq;
> > --
> > 2.7.4
> >
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 
> --
> Open Source Technology Center, Intel ltd.
> 
> $gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827