[PATCH] i915/drm/gvt: initialize CSB tail value with zero

Thu Sep 6 03:06:15 UTC 2018

On 2018.09.06 09:06:02 +0800, Xinyun Liu wrote:
> Thanks for zhenyu's review. The hang happens in guest OS.
> 
> The root cause should be that GVT reset csb pointer a little earlier, so that
> the guest OS process the request with invalid csb value and falls into chaos. 
>

Good catch.

> Following patch is for Acrn-kernel(4.14). Checked GVT staging branch, it's
> doesn't has the issue and doesn't need my fix which happens to have the same
> logic as:
> 
> e2c43c0111d5 - drm/i915/gvt: Move clean_workloads() into scheduler.c <Zhi Wang>
>

Fine. As 4.14 is LTS kernel, if you like to do a backport of this for upstream stable
kernel, you may send backport-ed one to stable list, reference mainline commit and cc
us for ack.

> Thanks,
> Xinyun
> 
> From c3b3b0f8f7068e2cfcf66f1ac3a451f254dc9afe Mon Sep 17 00:00:00 2001
> From: Xinyun Liu <xinyun.liu at intel.com>
> Date: Wed, 5 Sep 2018 20:54:27 +0800
> Subject: [PATCH v2] drm/i915/gvt: only kick out invalid request instead of
>  reset execlist
> 
> When GVT finds a request from DomU causes GPU hang, it needs to kick out
> this bad request and trigger DomU do GPU reset. The execlist csb
> information should be kept for DomU to process the request within
> itself. After DomU cleaned up the request and triggered GPU engine HW
> reset, GVT will reset the vGPU which includes the execlist.
> 
> Tracked-On: projectacrn/acrn-hypervisor#1130
> Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
> ---
>  drivers/gpu/drm/i915/gvt/execlist.c  | 1 -
>  drivers/gpu/drm/i915/gvt/execlist.h  | 1 +
>  drivers/gpu/drm/i915/gvt/scheduler.c | 2 +-
>  3 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
> index 9d74080f32f1..b182f037a67f 100644
> --- a/drivers/gpu/drm/i915/gvt/execlist.c
> +++ b/drivers/gpu/drm/i915/gvt/execlist.c
> @@ -47,7 +47,6 @@
>  		((a)->lrca == (b)->lrca))
>  
>  bool gvt_shadow_wa_ctx = false;
> -static void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
>  
>  static int context_switch_events[] = {
>  	[RCS] = RCS_AS_CONTEXT_SWITCH,
> diff --git a/drivers/gpu/drm/i915/gvt/execlist.h b/drivers/gpu/drm/i915/gvt/execlist.h
> index d2348b419303..f20284f81dd1 100644
> --- a/drivers/gpu/drm/i915/gvt/execlist.h
> +++ b/drivers/gpu/drm/i915/gvt/execlist.h
> @@ -182,4 +182,5 @@ int intel_vgpu_submit_execlist(struct intel_vgpu *vgpu, int ring_id);
>  void intel_vgpu_reset_execlist(struct intel_vgpu *vgpu,
>  		unsigned long engine_mask);
>  
> +void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
>  #endif /*_GVT_EXECLIST_H_*/
> diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
> index be910c951e71..d51b878b7fff 100644
> --- a/drivers/gpu/drm/i915/gvt/scheduler.c
> +++ b/drivers/gpu/drm/i915/gvt/scheduler.c
> @@ -705,7 +705,7 @@ static void complete_current_workload(struct intel_gvt *gvt, int ring_id)
>  	mutex_lock(&gvt->lock);
>  	list_del_init(&workload->list);
>  	if (workload->status == -EIO)
> -		intel_vgpu_reset_execlist(vgpu, 1 << ring_id);
> +		clean_workloads(vgpu, 1 << ring_id);
>  
>  	workload->complete(workload);
>  
> -- 
> 2.18.0
> 
> 
> On Fri, Aug 31, 2018 at 05:30:33PM +0800, Zhenyu Wang wrote:
> > On 2018.08.31 16:14:52 +0800, Xinyun Liu wrote:
> > > When run `./drv_hangman --run-subtest hangcheck-unterminated` with
> > > AcrnGT, vGPU reset falls into a dead loop because the original CSB tail
> > > value (0xF) was not updated correctly. In fact, the value should be zero
> > > after gpu reset caused by an invalid context. This dead loop also causes
> > > the kernel panic if there is some graphics workload running on the vGPU.
> > >
> > 
> > Is this guest kernel panic or host?
> > 
> > >  BUG: unable to handle kernel paging request at 00000000fffffffc
> > >  IP: process_csb+0x14a/0x2a0
> > >  PGD 0 P4D 0
> > >  Oops: 0002 [#1] PREEMPT SMP
> > >  Modules linked in: dwc3_pci dwc3 snd_usb_audio xhci_pci mei_me xhci_hcd snd_usbmidi_lib mei snd_hwdep hci_uart bluetooth ecdh_generic rfkill_gpio trusty_timer trusty_wall trusty_b
> > >  CPU: 0 PID: 1371 Comm: kworker/0:1H Tainted: P     U  W  O    4.14.61-quilt-2e5dc0ac-g0feae7d57171 #2
> > >  Hardware name:  ACRN-DM, BIOS 1.00 03/14/2014
> > >  Workqueue: events_highpri i915_error_reset
> > >  task: ffff88007cbc0040 task.stack: ffffc900010b0000
> > >  RIP: 0010:process_csb+0x14a/0x2a0
> > >  RSP: 0018:ffffc900010b3c90 EFLAGS: 00010206
> > >  RAX: 00000000fffffffc RBX: ffffc90001e02370 RCX: 0000000000000008
> > >  RDX: 0000000000000009 RSI: ffff88007c830308 RDI: 0000000000000000
> > >  RBP: ffffc900010b3cd8 R08: 0000000000000001 R09: 0000000000002370
> > >  R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007c758000
> > >  R13: 0000000000000007 R14: 0000000000000004 R15: ffff88007c830000
> > >  FS:  0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 00000000fffffffc CR3: 00000000796e0000 CR4: 00000000003406f0
> > >  Call Trace:
> > >   ? wake_up_process+0x20/0x20
> > >   execlists_reset_prepare+0x65/0x120
> > >   i915_gem_reset_prepare_engine+0x28/0x40
> > >   i915_reset_engine+0x1e/0xe0
> > >   i915_handle_error+0x117/0x470
> > >   ? cpuacct_charge+0x81/0x90
> > >   ? _raw_spin_unlock_irq+0x1e/0x40
> > >   ? finish_task_switch+0x8d/0x1f0
> > >   i915_error_reset+0x32/0x40
> > >   process_one_work+0x186/0x3e0
> > >   worker_thread+0x3d/0x3b0
> > >   kthread+0x132/0x150
> > >   ? process_one_work+0x3e0/0x3e0
> > >   ? kthread_create_on_node+0x70/0x70
> > >   ret_from_fork+0x3a/0x50
> > >  Code: 00 00 44 89 00 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 89 d8 31 d2 45 31 f6 e9 57 ff ff ff 0f 1f 44 00 00 48 85 c0 74 13 <f0> ff 08 0f 88 ed 6f 5d 00 75 08 48 89 c7
> > >  RIP: process_csb+0x14a/0x2a0 RSP: ffffc900010b3c90
> > >  CR2: 00000000fffffffc
> > >  ---[ end trace 5751fb1d7b00b459 ]---
> > > 
> > > Link: https://lists.projectacrn.org/g/acrn-dev/message/11136
> > > Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gvt/execlist.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
> > > index 70494e394d2c..768e0b467a11 100644
> > > --- a/drivers/gpu/drm/i915/gvt/execlist.c
> > > +++ b/drivers/gpu/drm/i915/gvt/execlist.c
> > > @@ -523,7 +523,7 @@ static void init_vgpu_execlist(struct intel_vgpu *vgpu, int ring_id)
> > >  			_EL_OFFSET_STATUS_PTR);
> > >  	ctx_status_ptr.dw = vgpu_vreg(vgpu, ctx_status_ptr_reg);
> > >  	ctx_status_ptr.read_ptr = 0;
> > > -	ctx_status_ptr.write_ptr = 0x7;
> > > +	ctx_status_ptr.write_ptr = 0;
> > >  	vgpu_vreg(vgpu, ctx_status_ptr_reg) = ctx_status_ptr.dw;
> > >  }
> > 
> > I think we do follow HW definition for initial execlist status regs value,
> > I haven't double checked with spec, but that's just my memory. And after
> > vgpu reset, it should be back to initial state. Is there any wrong assumption
> > during reset handle? Or maybe you could find the real reason for panic?
> > 
> > -- 
> > Open Source Technology Center, Intel ltd.
> > 
> > $gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827
> 
> 
> 
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

-- 
Open Source Technology Center, Intel ltd.

$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gvt-dev/attachments/20180906/e081f89d/attachment-0001.sig>