[Intel-gfx] [PATCH] kvm/vfio: Fix potential deadlock on vfio group_lock

Tue Jan 31 14:27:54 UTC 2023

I encountered a lockdep splat while running some regression tests today 
(see below). I suspected it might be this patch so I reverted it, 
rebuilt the kernel and ran the regression tests again; this time, the 
test ran cleanly. It looks like this patch may not have fixed the 
problem for which it was intended. Here is the relevant dmesg output:

[  579.471402] hades[1099]: Start test run
[  579.473486] hades[1099]: Start 
'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' test
[  579.505804] vfio_ap matrix: MDEV: Registered
[  579.604024] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Adding 
to iommu group 0

[  585.043898] ======================================================
[  585.043900] WARNING: possible circular locking dependency detected
[  585.043902] 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 Not tainted
[  585.043904] ------------------------------------------------------
[  585.043905] CPU 0/KVM/1173 is trying to acquire lock:
[  585.043907] 000000008cfb24b0 (&group->group_lock){+.+.}-{3:3}, at: 
vfio_file_set_kvm+0x50/0x68 [vfio]
[  585.043919]
                but task is already holding lock:
[  585.043920] 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: 
kvm_device_release+0x4a/0xb8 [kvm]
[  585.043960]
                which lock already depends on the new lock.

[  585.043962]
                the existing dependency chain (in reverse order) is:
[  585.043963]
                -> #3 (&kvm->lock){+.+.}-{3:3}:
[  585.043967]        __lock_acquire+0x3e2/0x750
[  585.043974]        lock_acquire.part.0+0xe2/0x250
[  585.043977]        lock_acquire+0xac/0x1d0
[  585.043980]        __mutex_lock+0x9e/0x868
[  585.043985]        mutex_lock_nested+0x32/0x40
[  585.043988]        vfio_ap_mdev_open_device+0x9a/0x198 [vfio_ap]
[  585.043991]        vfio_device_open+0x122/0x168 [vfio]
[  585.043995]        vfio_device_open_file+0x64/0x120 [vfio]
[  585.043999]        vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[  585.044002]        __s390x_sys_ioctl+0xc0/0x100
[  585.044007]        do_syscall+0xee/0x118
[  585.044032]        __do_syscall+0xd2/0x120
[  585.044035]        system_call+0x82/0xb0
[  585.044037]
                -> #2 (&matrix_dev->guests_lock){+.+.}-{3:3}:
[  585.044041]        __lock_acquire+0x3e2/0x750
[  585.044044]        lock_acquire.part.0+0xe2/0x250
[  585.044047]        lock_acquire+0xac/0x1d0
[  585.044049]        __mutex_lock+0x9e/0x868
[  585.044052]        mutex_lock_nested+0x32/0x40
[  585.044054]        vfio_ap_mdev_open_device+0x8c/0x198 [vfio_ap]
[  585.044057]        vfio_device_open+0x122/0x168 [vfio]
[  585.044060]        vfio_device_open_file+0x64/0x120 [vfio]
[  585.044064]        vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[  585.044068]        __s390x_sys_ioctl+0xc0/0x100
[  585.044070]        do_syscall+0xee/0x118
[  585.044072]        __do_syscall+0xd2/0x120
[  585.044074]        system_call+0x82/0xb0
[  585.044076]
                -> #1 (&new_dev_set->lock){+.+.}-{3:3}:
[  585.044080]        __lock_acquire+0x3e2/0x750
[  585.044082]        lock_acquire.part.0+0xe2/0x250
[  585.044085]        lock_acquire+0xac/0x1d0
[  585.044088]        __mutex_lock+0x9e/0x868
[  585.044090]        mutex_lock_nested+0x32/0x40
[  585.044093]        vfio_device_open+0x3e/0x168 [vfio]
[  585.044096]        vfio_device_open_file+0x64/0x120 [vfio]
[  585.044100]        vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[  585.044104]        __s390x_sys_ioctl+0xc0/0x100
[  585.044106]        do_syscall+0xee/0x118
[  585.044108]        __do_syscall+0xd2/0x120
[  585.044110]        system_call+0x82/0xb0
[  585.044112]
                -> #0 (&group->group_lock){+.+.}-{3:3}:
[  585.044115]        check_prev_add+0xd4/0xf10
[  585.044118]        validate_chain+0x698/0x8e8
[  585.044120]        __lock_acquire+0x3e2/0x750
[  585.044123]        lock_acquire.part.0+0xe2/0x250
[  585.044125]        lock_acquire+0xac/0x1d0
[  585.044128]        __mutex_lock+0x9e/0x868
[  585.044130]        mutex_lock_nested+0x32/0x40
[  585.044133]        vfio_file_set_kvm+0x50/0x68 [vfio]
[  585.044137]        kvm_vfio_release+0x5e/0xf8 [kvm]
[  585.044156]        kvm_device_release+0x90/0xb8 [kvm]
[  585.044175]        __fput+0xaa/0x2a0
[  585.044180]        task_work_run+0x76/0xd0
[  585.044183]        do_exit+0x248/0x538
[  585.044186]        do_group_exit+0x40/0xb0
[  585.044188]        get_signal+0x614/0x698
[  585.044192]        arch_do_signal_or_restart+0x58/0x370
[  585.044195]        exit_to_user_mode_loop+0xe8/0x1b8
[  585.044200]        exit_to_user_mode_prepare+0x164/0x190
[  585.044203]        __do_syscall+0xd2/0x120
[  585.044205]        system_call+0x82/0xb0
[  585.044207]
                other info that might help us debug this:

[  585.044209] Chain exists of:
                  &group->group_lock --> &matrix_dev->guests_lock --> 
&kvm->lock

[  585.044213]  Possible unsafe locking scenario:

[  585.044214]        CPU0                    CPU1
[  585.044216]        ----                    ----
[  585.044217]   lock(&kvm->lock);
[  585.044219] lock(&matrix_dev->guests_lock);
[  585.044221] lock(&kvm->lock);
[  585.044223]   lock(&group->group_lock);
[  585.044225]
                 *** DEADLOCK ***

[  585.044227] 1 lock held by CPU 0/KVM/1173:
[  585.044228]  #0: 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: 
kvm_device_release+0x4a/0xb8 [kvm]
[  585.044251]
                stack backtrace:
[  585.044253] CPU: 3 PID: 1173 Comm: CPU 0/KVM Not tainted 
6.2.0-rc6-00057-g41c03ba9beea-dirty #18
[  585.044256] Hardware name: IBM 8561 T01 772 (LPAR)
[  585.044257] Call Trace:
[  585.044258]  [<000000011a818936>] dump_stack_lvl+0x8e/0xc8
[  585.044261]  [<0000000119aca3f2>] check_noncircular+0x132/0x158
[  585.044264]  [<0000000119acba44>] check_prev_add+0xd4/0xf10
[  585.044267]  [<0000000119accf18>] validate_chain+0x698/0x8e8
[  585.044270]  [<0000000119ace70a>] __lock_acquire+0x3e2/0x750
[  585.044273]  [<0000000119acf682>] lock_acquire.part.0+0xe2/0x250
[  585.044276]  [<0000000119acf89c>] lock_acquire+0xac/0x1d0
[  585.044279]  [<000000011a823c66>] __mutex_lock+0x9e/0x868
[  585.044282]  [<000000011a824462>] mutex_lock_nested+0x32/0x40
[  585.044285]  [<000003ff7fbcd6a0>] vfio_file_set_kvm+0x50/0x68 [vfio]
[  585.044289]  [<000003ff7feacab6>] kvm_vfio_release+0x5e/0xf8 [kvm]
[  585.044308]  [<000003ff7fea6d58>] kvm_device_release+0x90/0xb8 [kvm]
[  585.044328]  [<0000000119dbb83a>] __fput+0xaa/0x2a0
[  585.044331]  [<0000000119a67c66>] task_work_run+0x76/0xd0
[  585.044333]  [<0000000119a3ec18>] do_exit+0x248/0x538
[  585.044335]  [<0000000119a3f0c8>] do_group_exit+0x40/0xb0
[  585.044338]  [<0000000119a50dec>] get_signal+0x614/0x698
[  585.044340]  [<00000001199ea030>] arch_do_signal_or_restart+0x58/0x370
[  585.044343]  [<0000000119b0bb50>] exit_to_user_mode_loop+0xe8/0x1b8
[  585.044346]  [<0000000119b0bd84>] exit_to_user_mode_prepare+0x164/0x190
[  585.044349]  [<000000011a818d2a>] __do_syscall+0xd2/0x120
[  585.044351]  [<000000011a82c462>] system_call+0x82/0xb0
[  585.044354] INFO: lockdep is turned off.
[  610.595528] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: 
Removing from iommu group 0
[  610.604408] vfio_ap matrix: MDEV: Unregistering
[  610.826074] hades[1099]: Stop 
'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest'

On 1/20/23 10:05 AM, Yi Liu wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation.  This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> VFIO holds group->group_lock/group_rwsem
>    -> kvm_put_kvm
>     -> kvm_destroy_vm
>      -> kvm_destroy_devices
>       -> kvm_vfio_destroy
>        -> kvm_vfio_file_set_kvm
>         -> vfio_file_set_kvm
>          -> try to hold group->group_lock/group_rwsem
>
> The key function is the kvm_destroy_devices() which triggers destroy cb
> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
> if this path doesn't call back to vfio, this dead lock would be fixed.
> Actually, there is a way for it. KVM provides another point to free the
> kvm-vfio device which is the point when the device file descriptor is
> closed. This can be achieved by providing the release cb instead of the
> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>
> 	/*
> 	 * Destroy is responsible for freeing dev.
> 	 *
> 	 * Destroy may be called before or after destructors are called
> 	 * on emulated I/O regions, depending on whether a reference is
> 	 * held by a vcpu or other kvm component that gets destroyed
> 	 * after the emulated I/O.
> 	 */
> 	void (*destroy)(struct kvm_device *dev);
>
> 	/*
> 	 * Release is an alternative method to free the device. It is
> 	 * called when the device file descriptor is closed. Once
> 	 * release is called, the destroy method will not be called
> 	 * anymore as the device is removed from the device list of
> 	 * the VM. kvm->lock is held.
> 	 */
> 	void (*release)(struct kvm_device *dev);
>
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <alex.williamson at redhat.com>
> Suggested-by: Kevin Tian <kevin.tian at intel.com>
> Reviewed-by: Jason Gunthorpe <jgg at nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu at intel.com>
> ---
>   virt/kvm/vfio.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 495ceabffe88..e94f3ea718e5 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>   	return -ENXIO;
>   }
>   
> -static void kvm_vfio_destroy(struct kvm_device *dev)
> +static void kvm_vfio_release(struct kvm_device *dev)
>   {
>   	struct kvm_vfio *kv = dev->private;
>   	struct kvm_vfio_group *kvg, *tmp;
> @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type);
>   static struct kvm_device_ops kvm_vfio_ops = {
>   	.name = "kvm-vfio",
>   	.create = kvm_vfio_create,
> -	.destroy = kvm_vfio_destroy,
> +	.release = kvm_vfio_release,
>   	.set_attr = kvm_vfio_set_attr,
>   	.has_attr = kvm_vfio_has_attr,
>   };