[Intel-gfx] [Bug 216388] New: On Host, kernel errors in KVM, on guests, it shows CPU stalls
Zhenyu Wang
zhenyuw at linux.intel.com
Mon Aug 22 23:21:40 UTC 2022
On 2022.08.22 17:50:33 +0000, Sean Christopherson wrote:
> +GVT folks
>
> On Sun, Aug 21, 2022, bugzilla-daemon at kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=216388
> >
> > Bug ID: 216388
> > Summary: On Host, kernel errors in KVM, on guests, it shows CPU
> > stalls
> > Product: Virtualization
> > Version: unspecified
> > Kernel Version: 5.19.0 / 5.19.1 / 5.19.2
> > Hardware: All
> > OS: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: high
> > Priority: P1
> > Component: kvm
> > Assignee: virtualization_kvm at kernel-bugs.osdl.org
> > Reporter: nanook at eskimo.com
> > Regression: No
> >
> > Created attachment 301614
> > --> https://bugzilla.kernel.org/attachment.cgi?id=301614&action=edit
> > The configuration file used to Comile this kernel.
> >
> > This behavior has persisted across 5.19.0, 5.19.1, and 5.19.2. While the
> > kernel I am taking this example from is tainted (owing to using Intel
> > development drivers for GPU virtualization), it is also occurring on
> > non-tainted kernels on servers with no development or third party modules
> > installed.
> >
> > INFO: task CPU 2/KVM:2343 blocked for more than 1228 seconds.
> > [207177.050049] Tainted: G U I 5.19.2 #1
> > [207177.050050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [207177.050051] task:CPU 2/KVM state:D stack: 0 pid: 2343 ppid: 1
> > flags:0x00000002
> > [207177.050054] Call Trace:
> > [207177.050055] <TASK>
> > [207177.050056] __schedule+0x359/0x1400
> > [207177.050060] ? kvm_mmu_page_fault+0x1ee/0x980
> > [207177.050062] ? kvm_set_msr_common+0x31f/0x1060
> > [207177.050065] schedule+0x5f/0x100
> > [207177.050066] schedule_preempt_disabled+0x15/0x30
> > [207177.050068] __mutex_lock.constprop.0+0x4e2/0x750
> > [207177.050070] ? aa_file_perm+0x124/0x4f0
> > [207177.050071] __mutex_lock_slowpath+0x13/0x20
> > [207177.050072] mutex_lock+0x25/0x30
> > [207177.050075] intel_vgpu_emulate_mmio_read+0x5d/0x3b0 [kvmgt]
>
> This isn't a KVM problem, it's a KVMGT problem (despite the name, KVMGT is very
> much not KVM).
>
> > [207177.050084] intel_vgpu_rw+0xb8/0x1c0 [kvmgt]
> > [207177.050091] intel_vgpu_read+0x20d/0x250 [kvmgt]
> > [207177.050097] vfio_device_fops_read+0x1f/0x40
> > [207177.050100] vfs_read+0x9b/0x160
> > [207177.050102] __x64_sys_pread64+0x93/0xd0
> > [207177.050104] do_syscall_64+0x58/0x80
> > [207177.050106] ? kvm_on_user_return+0x84/0xe0
> > [207177.050107] ? fire_user_return_notifiers+0x37/0x70
> > [207177.050109] ? exit_to_user_mode_prepare+0x41/0x200
> > [207177.050111] ? syscall_exit_to_user_mode+0x1b/0x40
> > [207177.050112] ? do_syscall_64+0x67/0x80
> > [207177.050114] ? irqentry_exit+0x54/0x70
> > [207177.050115] ? sysvec_call_function_single+0x4b/0xa0
> > [207177.050116] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [207177.050118] RIP: 0033:0x7ff51131293f
> > [207177.050119] RSP: 002b:00007ff4ddffa260 EFLAGS: 00000293 ORIG_RAX:
> > 0000000000000011
> > [207177.050121] RAX: ffffffffffffffda RBX: 00005599a6835420 RCX:
> > 00007ff51131293f
> > [207177.050122] RDX: 0000000000000004 RSI: 00007ff4ddffa2a8 RDI:
> > 0000000000000027
> > [207177.050123] RBP: 0000000000000004 R08: 0000000000000000 R09:
> > 00000000ffffffff
> > [207177.050124] R10: 0000000000065f10 R11: 0000000000000293 R12:
> > 0000000000065f10
> > [207177.050124] R13: 00005599a6835330 R14: 0000000000000004 R15:
> > 0000000000065f10
> > [207177.050126] </TASK>
> >
> > I am seeing this on Intel i7-6700k, i7-6850k, and i7-9700k platforms.
One recent regression fix on Comet Lake is https://patchwork.freedesktop.org/patch/496987/,
it's on the way to 6.0-rc and would be pushed to 5.19 stable as well. But looks this
report impacts on more platforms? We'll double check.
Thanks
> >
> > This did not happen on 5.17 kernels, and 5.18 kernels never ran stable
> > enough on my platforms to actually run them for more than a few minutes.
> >
> > Likewise 6.0-rc1 has not been stable enough to run in production. After
> > less than three hours running on my workstation it locked hard with even the
> > magic sys-request key being unresponsive and only power cycling the machine got
> > it back.
> >
> > The operating system in use for the host on all machines is Ubuntu 22.04.
> >
> > Guests vary with Ubuntu 22.04 being the most common but also Mint, Debian,
> > Manjaro, Centos, Fedora, ScientificLinux, Zorin, and Windows being in use.
> >
> > I see the same issue manifest on platforms running only Ubuntu guests as
> > with guests of varying operating systems.
> >
> > The configuration file I used to compile this kernel is attached. I
> > compiled it with gcc 12.1.0.
> >
> > This behavior does not manifest itself instantly, typically the machine
> > needs to be running 3-7 days before it does. Once it does guests keep stalling
> > and restarting libvirtd does not help. Only thing that seems to is a hard
> > reboot of the physical host. For this reason I believe the issue lies strictly
> > with the host and not the guests.
> >
> > I have listed it as a severity of high since it is completely service
> > interrupting.
> >
> > --
> > You may reply to this email to add a comment.
> >
> > You are receiving this mail because:
> > You are watching the assignee of the bug.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20220823/c8160bb1/attachment.sig>
More information about the Intel-gfx
mailing list