[Bug 100099] New: [BAT] [KBL] nvme driver is using a mutex inside an rcu callback causing igt at gem_exec_suspend@basic-s4-devices to fail with dmesg-warn

Tue Mar 7 12:02:21 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=100099

            Bug ID: 100099
           Summary: [BAT] [KBL] nvme driver is using a mutex inside an rcu
                    callback causing igt at gem_exec_suspend@basic-s4-devices
                    to fail with dmesg-warn
           Product: DRI
           Version: DRI git
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: critical
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: jari.tahvanainen at intel.com
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

Starting from CI_DRM_2252 igt at gem_exec_suspend@basic-s4-devices have had
dmesg-warn as end state.
See
https://intel-gfx-ci.01.org/CI/CI_DRM_2296/fi-kbl-7500u/igt@gem_exec_suspend@basic-s4-devices.html
for the latest output and links to dmesg files. Some data copied below:

Command 
/opt/igt/tests/gem_exec_suspend --run-subtest basic-S4-devices
Dmesg   
[  245.192104] Suspending console(s) (use no_console_suspend to debug)

[  245.294768] ======================================================
[  245.294769] [ INFO: possible circular locking dependency detected ]
[  245.294771] 4.11.0-rc1-CI-CI_DRM_2296+ #1 Not tainted
[  245.294772] -------------------------------------------------------
[  245.294773] kworker/u8:1/72 is trying to acquire lock:
[  245.294774]  (rcu_preempt_state.barrier_mutex){+.+...}, at:
[<ffffffff81100731>] _rcu_barrier+0x31/0x160
[  245.294782] 
               but task is already holding lock:
[  245.294783]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa00bbc48>]
i915_gem_freeze+0x18/0x30 [i915]
[  245.294827] 
               which lock already depends on the new lock.

[  245.294828] 
               the existing dependency chain (in reverse order) is:
[  245.294829] 
               -> #5 (&dev->struct_mutex){+.+.+.}:
[  245.294836]        lock_acquire+0xc9/0x220
[  245.294842]        drm_gem_object_put_unlocked+0x40/0xc0
[  245.294845]        drm_gem_mmap+0xd1/0x100
[  245.294849]        mmap_region+0x39e/0x600
[  245.294851]        do_mmap+0x44c/0x500
[  245.294853]        vm_mmap_pgoff+0x9d/0xd0
[  245.294856]        SyS_mmap_pgoff+0x182/0x220
[  245.294858]        SyS_mmap+0x16/0x20
[  245.294862]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  245.294863] 
               -> #4 (&mm->mmap_sem){++++++}:
[  245.294868]        lock_acquire+0xc9/0x220
[  245.294871]        __might_fault+0x6b/0x90
[  245.294873]        filldir+0xb2/0x140
[  245.294876]        dcache_readdir+0xf4/0x170
[  245.294877]        iterate_dir+0x7a/0x170
[  245.294879]        SyS_getdents+0xa2/0x150
[  245.294881]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  245.294882] 
               -> #3 (&sb->s_type->i_mutex_key#3){+++++.}:
[  245.294888]        lock_acquire+0xc9/0x220
[  245.294890]        down_write+0x3f/0x70
[  245.294894]        start_creating+0x5a/0x100
[  245.294897]        debugfs_create_dir+0xc/0xd0
[  245.294900]        blk_mq_debugfs_register+0x1c/0x70
[  245.294904]        blk_mq_register_dev+0xbd/0x120
[  245.294908]        blk_register_queue+0x4c/0x150
[  245.294910]        device_add_disk+0x23b/0x4f0
[  245.294914]        nvme_validate_ns+0x1fa/0x260
[  245.294917]        nvme_scan_work+0x188/0x300
[  245.294921]        process_one_work+0x1f4/0x6d0
[  245.294924]        worker_thread+0x49/0x4a0
[  245.294927]        kthread+0x107/0x140
[  245.294929]        ret_from_fork+0x2e/0x40
[  245.294929] 
               -> #2 (all_q_mutex){+.+.+.}:
[  245.294935]        lock_acquire+0xc9/0x220
[  245.294938]        __mutex_lock+0x6e/0x990
[  245.294941]        mutex_lock_nested+0x16/0x20
[  245.294944]        blk_mq_queue_reinit_work+0x13/0x110
[  245.294947]        blk_mq_queue_reinit_dead+0x17/0x20
[  245.294950]        cpuhp_invoke_callback+0x9c/0x8a0
[  245.294953]        cpuhp_down_callbacks+0x3d/0x80
[  245.294956]        _cpu_down+0xad/0xe0
[  245.294959]        freeze_secondary_cpus+0xb1/0x3d0
[  245.294962]        suspend_devices_and_enter+0x4a0/0xca0
[  245.294964]        pm_suspend+0x59e/0xa50
[  245.294966]        state_store+0x7d/0xf0
[  245.294969]        kobj_attr_store+0xf/0x20
[  245.294972]        sysfs_kf_write+0x40/0x50
[  245.294974]        kernfs_fop_write+0x130/0x1b0
[  245.294977]        __vfs_write+0x23/0x120
[  245.294979]        vfs_write+0xc6/0x1f0
[  245.294982]        SyS_write+0x44/0xb0
[  245.294984]        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  245.294985] 
               -> #1 (cpu_hotplug.lock){+.+.+.}:
[  245.294990]        lock_acquire+0xc9/0x220
[  245.294993]        __mutex_lock+0x6e/0x990
[  245.294996]        mutex_lock_nested+0x16/0x20
[  245.294999]        get_online_cpus+0x61/0x80
[  245.295002]        _rcu_barrier+0x9f/0x160
[  245.295004]        rcu_barrier+0x10/0x20
[  245.295008]        netdev_run_todo+0x5f/0x310
[  245.295011]        rtnl_unlock+0x9/0x10
[  245.295013]        default_device_exit_batch+0x133/0x150
[  245.295017]        ops_exit_list.isra.0+0x4d/0x60
[  245.295020]        cleanup_net+0x1d8/0x2c0
[  245.295023]        process_one_work+0x1f4/0x6d0
[  245.295026]        worker_thread+0x49/0x4a0
[  245.295028]        kthread+0x107/0x140
[  245.295030]        ret_from_fork+0x2e/0x40
[  245.295030] 
               -> #0 (rcu_preempt_state.barrier_mutex){+.+...}:
[  245.295036]        __lock_acquire+0x1b44/0x1b50
[  245.295039]        lock_acquire+0xc9/0x220
[  245.295042]        __mutex_lock+0x6e/0x990
[  245.295045]        mutex_lock_nested+0x16/0x20
[  245.295047]        _rcu_barrier+0x31/0x160
[  245.295049]        rcu_barrier+0x10/0x20
[  245.295089]        i915_gem_shrink_all+0x33/0x40 [i915]
[  245.295126]        i915_gem_freeze+0x20/0x30 [i915]
[  245.295152]        i915_pm_freeze+0x1d/0x20 [i915]
[  245.295155]        pci_pm_freeze+0x59/0xe0
[  245.295157]        dpm_run_callback+0x6f/0x330
[  245.295159]        __device_suspend+0xea/0x330
[  245.295161]        async_suspend+0x1a/0x90
[  245.295165]        async_run_entry_fn+0x34/0x160
[  245.295168]        process_one_work+0x1f4/0x6d0
[  245.295171]        worker_thread+0x49/0x4a0
[  245.295173]        kthread+0x107/0x140
[  245.295175]        ret_from_fork+0x2e/0x40
[  245.295176] 
               other info that might help us debug this:

[  245.295177] Chain exists of:
                 rcu_preempt_state.barrier_mutex --> &mm->mmap_sem -->
&dev->struct_mutex

[  245.295181]  Possible unsafe locking scenario:

[  245.295182]        CPU0                    CPU1
[  245.295183]        ----                    ----
[  245.295183]   lock(&dev->struct_mutex);
[  245.295185]                                lock(&mm->mmap_sem);
[  245.295187]                                lock(&dev->struct_mutex);
[  245.295189]   lock(rcu_preempt_state.barrier_mutex);
[  245.295191] 
                *** DEADLOCK ***

[  245.295192] 4 locks held by kworker/u8:1/72:
[  245.295193]  #0:  ("events_unbound"){.+.+.+}, at: [<ffffffff8109dcae>]
process_one_work+0x16e/0x6d0
[  245.295199]  #1:  ((&entry->work)){+.+.+.}, at: [<ffffffff8109dcae>]
process_one_work+0x16e/0x6d0
[  245.295204]  #2:  (&dev->mutex){......}, at: [<ffffffff815d5e34>]
__device_suspend+0xb4/0x330
[  245.295208]  #3:  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa00bbc48>]
i915_gem_freeze+0x18/0x30 [i915]
[  245.295247] 
               stack backtrace:
[  245.295250] CPU: 2 PID: 72 Comm: kworker/u8:1 Not tainted
4.11.0-rc1-CI-CI_DRM_2296+ #1
[  245.295251] Hardware name: GIGABYTE GB-BKi7A-7500/MFLP7AP-00, BIOS F1
07/27/2016
[  245.295256] Workqueue: events_unbound async_run_entry_fn
[  245.295258] Call Trace:
[  245.295262]  dump_stack+0x67/0x92
[  245.295266]  print_circular_bug+0x1e0/0x2e0
[  245.295269]  __lock_acquire+0x1b44/0x1b50
[  245.295273]  ? trace_hardirqs_on+0xd/0x10
[  245.295277]  lock_acquire+0xc9/0x220
[  245.295280]  ? _rcu_barrier+0x31/0x160
[  245.295282]  ? _rcu_barrier+0x31/0x160
[  245.295286]  __mutex_lock+0x6e/0x990
[  245.295288]  ? _rcu_barrier+0x31/0x160
[  245.295290]  ? _rcu_barrier+0x31/0x160
[  245.295293]  ? synchronize_rcu_expedited+0x35/0xb0
[  245.295296]  ? _raw_spin_unlock_irqrestore+0x52/0x60
[  245.295300]  mutex_lock_nested+0x16/0x20
[  245.295302]  _rcu_barrier+0x31/0x160
[  245.295305]  rcu_barrier+0x10/0x20
[  245.295343]  i915_gem_shrink_all+0x33/0x40 [i915]
[  245.295378]  i915_gem_freeze+0x20/0x30 [i915]
[  245.295404]  i915_pm_freeze+0x1d/0x20 [i915]
[  245.295406]  pci_pm_freeze+0x59/0xe0
[  245.295409]  dpm_run_callback+0x6f/0x330
[  245.295411]  ? pci_pm_poweroff+0xf0/0xf0
[  245.295413]  __device_suspend+0xea/0x330
[  245.295415]  async_suspend+0x1a/0x90
[  245.295419]  async_run_entry_fn+0x34/0x160
[  245.295422]  process_one_work+0x1f4/0x6d0
[  245.295425]  ? process_one_work+0x16e/0x6d0
[  245.295429]  worker_thread+0x49/0x4a0
[  245.295432]  kthread+0x107/0x140
[  245.295435]  ? process_one_work+0x6d0/0x6d0
[  245.295437]  ? kthread_create_on_node+0x40/0x40
[  245.295440]  ret_from_fork+0x2e/0x40
[  250.434263] usb usb1: root hub lost power or was reset
[  250.434265] usb usb2: root hub lost power or was reset
[  250.434719] usb usb3: root hub lost power or was reset
[  250.434720] usb usb4: root hub lost power or was reset

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20170307/7fec2b60/attachment.html>