[PATCH] i915/drm/gvt: initialize CSB tail value with zero
Zhenyu Wang
zhenyuw at linux.intel.com
Thu Sep 6 03:06:15 UTC 2018
On 2018.09.06 09:06:02 +0800, Xinyun Liu wrote:
> Thanks for zhenyu's review. The hang happens in guest OS.
>
> The root cause should be that GVT reset csb pointer a little earlier, so that
> the guest OS process the request with invalid csb value and falls into chaos.
>
Good catch.
> Following patch is for Acrn-kernel(4.14). Checked GVT staging branch, it's
> doesn't has the issue and doesn't need my fix which happens to have the same
> logic as:
>
> e2c43c0111d5 - drm/i915/gvt: Move clean_workloads() into scheduler.c <Zhi Wang>
>
Fine. As 4.14 is LTS kernel, if you like to do a backport of this for upstream stable
kernel, you may send backport-ed one to stable list, reference mainline commit and cc
us for ack.
> Thanks,
> Xinyun
>
> From c3b3b0f8f7068e2cfcf66f1ac3a451f254dc9afe Mon Sep 17 00:00:00 2001
> From: Xinyun Liu <xinyun.liu at intel.com>
> Date: Wed, 5 Sep 2018 20:54:27 +0800
> Subject: [PATCH v2] drm/i915/gvt: only kick out invalid request instead of
> reset execlist
>
> When GVT finds a request from DomU causes GPU hang, it needs to kick out
> this bad request and trigger DomU do GPU reset. The execlist csb
> information should be kept for DomU to process the request within
> itself. After DomU cleaned up the request and triggered GPU engine HW
> reset, GVT will reset the vGPU which includes the execlist.
>
> Tracked-On: projectacrn/acrn-hypervisor#1130
> Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
> ---
> drivers/gpu/drm/i915/gvt/execlist.c | 1 -
> drivers/gpu/drm/i915/gvt/execlist.h | 1 +
> drivers/gpu/drm/i915/gvt/scheduler.c | 2 +-
> 3 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
> index 9d74080f32f1..b182f037a67f 100644
> --- a/drivers/gpu/drm/i915/gvt/execlist.c
> +++ b/drivers/gpu/drm/i915/gvt/execlist.c
> @@ -47,7 +47,6 @@
> ((a)->lrca == (b)->lrca))
>
> bool gvt_shadow_wa_ctx = false;
> -static void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
>
> static int context_switch_events[] = {
> [RCS] = RCS_AS_CONTEXT_SWITCH,
> diff --git a/drivers/gpu/drm/i915/gvt/execlist.h b/drivers/gpu/drm/i915/gvt/execlist.h
> index d2348b419303..f20284f81dd1 100644
> --- a/drivers/gpu/drm/i915/gvt/execlist.h
> +++ b/drivers/gpu/drm/i915/gvt/execlist.h
> @@ -182,4 +182,5 @@ int intel_vgpu_submit_execlist(struct intel_vgpu *vgpu, int ring_id);
> void intel_vgpu_reset_execlist(struct intel_vgpu *vgpu,
> unsigned long engine_mask);
>
> +void clean_workloads(struct intel_vgpu *vgpu, unsigned long engine_mask);
> #endif /*_GVT_EXECLIST_H_*/
> diff --git a/drivers/gpu/drm/i915/gvt/scheduler.c b/drivers/gpu/drm/i915/gvt/scheduler.c
> index be910c951e71..d51b878b7fff 100644
> --- a/drivers/gpu/drm/i915/gvt/scheduler.c
> +++ b/drivers/gpu/drm/i915/gvt/scheduler.c
> @@ -705,7 +705,7 @@ static void complete_current_workload(struct intel_gvt *gvt, int ring_id)
> mutex_lock(&gvt->lock);
> list_del_init(&workload->list);
> if (workload->status == -EIO)
> - intel_vgpu_reset_execlist(vgpu, 1 << ring_id);
> + clean_workloads(vgpu, 1 << ring_id);
>
> workload->complete(workload);
>
> --
> 2.18.0
>
>
> On Fri, Aug 31, 2018 at 05:30:33PM +0800, Zhenyu Wang wrote:
> > On 2018.08.31 16:14:52 +0800, Xinyun Liu wrote:
> > > When run `./drv_hangman --run-subtest hangcheck-unterminated` with
> > > AcrnGT, vGPU reset falls into a dead loop because the original CSB tail
> > > value (0xF) was not updated correctly. In fact, the value should be zero
> > > after gpu reset caused by an invalid context. This dead loop also causes
> > > the kernel panic if there is some graphics workload running on the vGPU.
> > >
> >
> > Is this guest kernel panic or host?
> >
> > > BUG: unable to handle kernel paging request at 00000000fffffffc
> > > IP: process_csb+0x14a/0x2a0
> > > PGD 0 P4D 0
> > > Oops: 0002 [#1] PREEMPT SMP
> > > Modules linked in: dwc3_pci dwc3 snd_usb_audio xhci_pci mei_me xhci_hcd snd_usbmidi_lib mei snd_hwdep hci_uart bluetooth ecdh_generic rfkill_gpio trusty_timer trusty_wall trusty_b
> > > CPU: 0 PID: 1371 Comm: kworker/0:1H Tainted: P U W O 4.14.61-quilt-2e5dc0ac-g0feae7d57171 #2
> > > Hardware name: ACRN-DM, BIOS 1.00 03/14/2014
> > > Workqueue: events_highpri i915_error_reset
> > > task: ffff88007cbc0040 task.stack: ffffc900010b0000
> > > RIP: 0010:process_csb+0x14a/0x2a0
> > > RSP: 0018:ffffc900010b3c90 EFLAGS: 00010206
> > > RAX: 00000000fffffffc RBX: ffffc90001e02370 RCX: 0000000000000008
> > > RDX: 0000000000000009 RSI: ffff88007c830308 RDI: 0000000000000000
> > > RBP: ffffc900010b3cd8 R08: 0000000000000001 R09: 0000000000002370
> > > R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007c758000
> > > R13: 0000000000000007 R14: 0000000000000004 R15: ffff88007c830000
> > > FS: 0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000fffffffc CR3: 00000000796e0000 CR4: 00000000003406f0
> > > Call Trace:
> > > ? wake_up_process+0x20/0x20
> > > execlists_reset_prepare+0x65/0x120
> > > i915_gem_reset_prepare_engine+0x28/0x40
> > > i915_reset_engine+0x1e/0xe0
> > > i915_handle_error+0x117/0x470
> > > ? cpuacct_charge+0x81/0x90
> > > ? _raw_spin_unlock_irq+0x1e/0x40
> > > ? finish_task_switch+0x8d/0x1f0
> > > i915_error_reset+0x32/0x40
> > > process_one_work+0x186/0x3e0
> > > worker_thread+0x3d/0x3b0
> > > kthread+0x132/0x150
> > > ? process_one_work+0x3e0/0x3e0
> > > ? kthread_create_on_node+0x70/0x70
> > > ret_from_fork+0x3a/0x50
> > > Code: 00 00 44 89 00 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 89 d8 31 d2 45 31 f6 e9 57 ff ff ff 0f 1f 44 00 00 48 85 c0 74 13 <f0> ff 08 0f 88 ed 6f 5d 00 75 08 48 89 c7
> > > RIP: process_csb+0x14a/0x2a0 RSP: ffffc900010b3c90
> > > CR2: 00000000fffffffc
> > > ---[ end trace 5751fb1d7b00b459 ]---
> > >
> > > Link: https://lists.projectacrn.org/g/acrn-dev/message/11136
> > > Signed-off-by: Xinyun Liu <xinyun.liu at intel.com>
> > > ---
> > > drivers/gpu/drm/i915/gvt/execlist.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gvt/execlist.c b/drivers/gpu/drm/i915/gvt/execlist.c
> > > index 70494e394d2c..768e0b467a11 100644
> > > --- a/drivers/gpu/drm/i915/gvt/execlist.c
> > > +++ b/drivers/gpu/drm/i915/gvt/execlist.c
> > > @@ -523,7 +523,7 @@ static void init_vgpu_execlist(struct intel_vgpu *vgpu, int ring_id)
> > > _EL_OFFSET_STATUS_PTR);
> > > ctx_status_ptr.dw = vgpu_vreg(vgpu, ctx_status_ptr_reg);
> > > ctx_status_ptr.read_ptr = 0;
> > > - ctx_status_ptr.write_ptr = 0x7;
> > > + ctx_status_ptr.write_ptr = 0;
> > > vgpu_vreg(vgpu, ctx_status_ptr_reg) = ctx_status_ptr.dw;
> > > }
> >
> > I think we do follow HW definition for initial execlist status regs value,
> > I haven't double checked with spec, but that's just my memory. And after
> > vgpu reset, it should be back to initial state. Is there any wrong assumption
> > during reset handle? Or maybe you could find the real reason for panic?
> >
> > --
> > Open Source Technology Center, Intel ltd.
> >
> > $gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827
>
>
>
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
>
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Open Source Technology Center, Intel ltd.
$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gvt-dev/attachments/20180906/e081f89d/attachment-0001.sig>
More information about the intel-gvt-dev
mailing list