[PATCH] i915/drm/gvt: initialize CSB tail value with zero

Thu Sep 6 01:06:02 UTC 2018

Thanks for zhenyu's review. The hang happens in guest OS.

The root cause should be that GVT reset csb pointer a little earlier, so that
the guest OS process the request with invalid csb value and falls into chaos. 

Following patch is for Acrn-kernel(4.14). Checked GVT staging branch, it's
doesn't has the issue and doesn't need my fix which happens to have the same
logic as:

e2c43c0111d5 - drm/i915/gvt: Move clean_workloads() into scheduler.c <Zhi Wang>

Thanks,
Xinyun

>From c3b3b0f8f7068e2cfcf66f1ac3a451f254dc9afe Mon Sep 17 00:00:00 2001
From: Xinyun Liu <xinyun.liu at intel.com>
Date: Wed, 5 Sep 2018 20:54:27 +0800
Subject: [PATCH v2] drm/i915/gvt: only kick out invalid request instead of
 reset execlist

When GVT finds a request from DomU causes GPU hang, it needs to kick out
this bad request and trigger DomU do GPU reset. The execlist csb
information should be kept for DomU to process the request within
itself. After DomU cleaned up the request and triggered GPU engine HW
reset, GVT will reset the vGPU which includes the execlist.

Tracked-On: projectacrn/acrn-hypervisor#1130
Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
---
 drivers/gpu/drm/i915/gvt/execlist.c  | 1 -
 drivers/gpu/drm/i915/gvt/execlist.h  | 1 +
 drivers/gpu/drm/i915/gvt/scheduler.c | 2 +-
 3 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
index 9d74080f32f1..b182f037a67f 100644
--- a/drivers/gpu/drm/i915/gvt/execlist.c
+++ b/drivers/gpu/drm/i915/gvt/execlist.c
@@ -47,7 +47,6 @@
 		((a)->lrca == (b)->lrca))
 
 bool gvt_shadow_wa_ctx = false;
-static void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
 
 static int context_switch_events[] = {
 	[RCS] = RCS_AS_CONTEXT_SWITCH,
diff --git a/drivers/gpu/drm/i915/gvt/execlist.h b/drivers/gpu/drm/i915/gvt/execlist.h
index d2348b419303..f20284f81dd1 100644
--- a/drivers/gpu/drm/i915/gvt/execlist.h
+++ b/drivers/gpu/drm/i915/gvt/execlist.h
@@ -182,4 +182,5 @@ int intel_vgpu_submit_execlist(struct intel_vgpu *vgpu, int ring_id);
 void intel_vgpu_reset_execlist(struct intel_vgpu *vgpu,
 		unsigned long engine_mask);
 
+void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
 #endif /*_GVT_EXECLIST_H_*/
diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
index be910c951e71..d51b878b7fff 100644
--- a/drivers/gpu/drm/i915/gvt/scheduler.c
+++ b/drivers/gpu/drm/i915/gvt/scheduler.c
@@ -705,7 +705,7 @@ static void complete_current_workload(struct intel_gvt *gvt, int ring_id)
 	mutex_lock(&gvt->lock);
 	list_del_init(&workload->list);
 	if (workload->status == -EIO)
-		intel_vgpu_reset_execlist(vgpu, 1 << ring_id);
+		clean_workloads(vgpu, 1 << ring_id);
 
 	workload->complete(workload);
 
-- 
2.18.0


On Fri, Aug 31, 2018 at 05:30:33PM +0800, Zhenyu Wang wrote:
> On 2018.08.31 16:14:52 +0800, Xinyun Liu wrote:
> > When run `./drv_hangman --run-subtest hangcheck-unterminated` with
> > AcrnGT, vGPU reset falls into a dead loop because the original CSB tail
> > value (0xF) was not updated correctly. In fact, the value should be zero
> > after gpu reset caused by an invalid context. This dead loop also causes
> > the kernel panic if there is some graphics workload running on the vGPU.
> >
> 
> Is this guest kernel panic or host?
> 
> >  BUG: unable to handle kernel paging request at 00000000fffffffc
> >  IP: process_csb+0x14a/0x2a0
> >  PGD 0 P4D 0
> >  Oops: 0002 [#1] PREEMPT SMP
> >  Modules linked in: dwc3_pci dwc3 snd_usb_audio xhci_pci mei_me xhci_hcd snd_usbmidi_lib mei snd_hwdep hci_uart bluetooth ecdh_generic rfkill_gpio trusty_timer trusty_wall trusty_b
> >  CPU: 0 PID: 1371 Comm: kworker/0:1H Tainted: P     U  W  O    4.14.61-quilt-2e5dc0ac-g0feae7d57171 #2
> >  Hardware name:  ACRN-DM, BIOS 1.00 03/14/2014
> >  Workqueue: events_highpri i915_error_reset
> >  task: ffff88007cbc0040 task.stack: ffffc900010b0000
> >  RIP: 0010:process_csb+0x14a/0x2a0
> >  RSP: 0018:ffffc900010b3c90 EFLAGS: 00010206
> >  RAX: 00000000fffffffc RBX: ffffc90001e02370 RCX: 0000000000000008
> >  RDX: 0000000000000009 RSI: ffff88007c830308 RDI: 0000000000000000
> >  RBP: ffffc900010b3cd8 R08: 0000000000000001 R09: 0000000000002370
> >  R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007c758000
> >  R13: 0000000000000007 R14: 0000000000000004 R15: ffff88007c830000
> >  FS:  0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00000000fffffffc CR3: 00000000796e0000 CR4: 00000000003406f0
> >  Call Trace:
> >   ? wake_up_process+0x20/0x20
> >   execlists_reset_prepare+0x65/0x120
> >   i915_gem_reset_prepare_engine+0x28/0x40
> >   i915_reset_engine+0x1e/0xe0
> >   i915_handle_error+0x117/0x470
> >   ? cpuacct_charge+0x81/0x90
> >   ? _raw_spin_unlock_irq+0x1e/0x40
> >   ? finish_task_switch+0x8d/0x1f0
> >   i915_error_reset+0x32/0x40
> >   process_one_work+0x186/0x3e0
> >   worker_thread+0x3d/0x3b0
> >   kthread+0x132/0x150
> >   ? process_one_work+0x3e0/0x3e0
> >   ? kthread_create_on_node+0x70/0x70
> >   ret_from_fork+0x3a/0x50
> >  Code: 00 00 44 89 00 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 89 d8 31 d2 45 31 f6 e9 57 ff ff ff 0f 1f 44 00 00 48 85 c0 74 13 <f0> ff 08 0f 88 ed 6f 5d 00 75 08 48 89 c7
> >  RIP: process_csb+0x14a/0x2a0 RSP: ffffc900010b3c90
> >  CR2: 00000000fffffffc
> >  ---[ end trace 5751fb1d7b00b459 ]---
> > 
> > Link: https://lists.projectacrn.org/g/acrn-dev/message/11136
> > Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
> > ---
> >  drivers/gpu/drm/i915/gvt/execlist.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
> > index 70494e394d2c..768e0b467a11 100644
> > --- a/drivers/gpu/drm/i915/gvt/execlist.c
> > +++ b/drivers/gpu/drm/i915/gvt/execlist.c
> > @@ -523,7 +523,7 @@ static void init_vgpu_execlist(struct intel_vgpu *vgpu, int ring_id)
> >  			_EL_OFFSET_STATUS_PTR);
> >  	ctx_status_ptr.dw = vgpu_vreg(vgpu, ctx_status_ptr_reg);
> >  	ctx_status_ptr.read_ptr = 0;
> > -	ctx_status_ptr.write_ptr = 0x7;
> > +	ctx_status_ptr.write_ptr = 0;
> >  	vgpu_vreg(vgpu, ctx_status_ptr_reg) = ctx_status_ptr.dw;
> >  }
> 
> I think we do follow HW definition for initial execlist status regs value,
> I haven't double checked with spec, but that's just my memory. And after
> vgpu reset, it should be back to initial state. Is there any wrong assumption
> during reset handle? Or maybe you could find the real reason for panic?
> 
> -- 
> Open Source Technology Center, Intel ltd.
> 
> $gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827



> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev