[Intel-gfx] [PATCH 24/25] drm/i915: Perform device reset under stop-machine
Chris Wilson
chris at chris-wilson.co.uk
Fri Nov 2 17:46:08 UTC 2018
Quoting Chris Wilson (2018-11-02 16:12:31)
> If we do a device level reset, we lose vital registers that may be in
> concurrent use by userspace (i.e. the GGTT and its fencing). To be
> paranoid and prevent that memory access from being corrupted, we want to
> pause all other processes/threads, so that the device reset is the only
> thing running on the system.
If we can live with a glitch in GTT read/writes across a device level
reset, we can ignore the requirement to take stop_machine, and relax the
atomicity requirements.
stop_machine() is quite nasty as it impacts several core systems that
allocate structs inside the cpu_hotplug mutex, e.g.:
<4> [1802.499647] ======================================================
<4> [1802.499649] WARNING: possible circular locking dependency detected
<4> [1802.499652] 4.19.0-CI-Trybot_3176+ #1 Tainted: G U
<4> [1802.499653] ------------------------------------------------------
<4> [1802.499655] drv_selftest/12037 is trying to acquire lock:
<4> [1802.499657] 00000000f9309a5f (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x12/0x30
<4> [1802.499664] \x0abut task is already holding lock:
<4> [1802.499666] 0000000036dadfbb (&dev->struct_mutex){+.+.}, at: igt_reset_queue+0x3f/0x520 [i915]
<4> [1802.499698] \x0awhich lock already depends on the new lock.\x0a
<4> [1802.499701] \x0athe existing dependency chain (in reverse order) is:
<4> [1802.499703] \x0a-> #4 (&dev->struct_mutex){+.+.}:
<4> [1802.499707] 0xffffffffa01c9337
<4> [1802.499709] 0xffffffffa01cf89c
<4> [1802.499713] __do_fault+0x1b/0x80
<4> [1802.499715] __handle_mm_fault+0x8e0/0xf10
<4> [1802.499717] handle_mm_fault+0x196/0x3a0
<4> [1802.499721] __do_page_fault+0x295/0x590
<4> [1802.499724] page_fault+0x1e/0x30
<4> [1802.499726] \x0a-> #3 (&mm->mmap_sem){++++}:
<4> [1802.499731] _copy_to_user+0x1e/0x70
<4> [1802.499734] perf_read+0x232/0x2a0
<4> [1802.499737] __vfs_read+0x31/0x170
<4> [1802.499739] vfs_read+0x9e/0x140
<4> [1802.499742] ksys_read+0x50/0xc0
<4> [1802.499744] do_syscall_64+0x55/0x190
<4> [1802.499758] entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1802.499760] \x0a-> #2 (&cpuctx_mutex){+.+.}:
<4> [1802.499764] perf_event_init_cpu+0x5a/0x90
<4> [1802.499767] perf_event_init+0x19d/0x1cd
<4> [1802.499770] start_kernel+0x32b/0x4c0
<4> [1802.499772] secondary_startup_64+0xa4/0xb0
<4> [1802.499774] \x0a-> #1 (pmus_lock){+.+.}:
<4> [1802.499777] perf_event_init_cpu+0x21/0x90
<4> [1802.499780] cpuhp_invoke_callback+0x97/0x9b0
<4> [1802.499782] _cpu_up+0xa2/0x130
<4> [1802.499784] do_cpu_up+0x91/0xc0
<4> [1802.499787] smp_init+0x5d/0xa9
<4> [1802.499790] kernel_init_freeable+0x16f/0x352
<4> [1802.499795] kernel_init+0x5/0x100
<4> [1802.499798] ret_from_fork+0x3a/0x50
<4> [1802.499812] \x0a-> #0 (cpu_hotplug_lock.rw_sem){++++}:
<4> [1802.499819] cpus_read_lock+0x34/0xa0
<4> [1802.499822] stop_machine+0x12/0x30
<4> [1802.499856] i915_reset+0x1c0/0x360 [i915]
<4> [1802.499898] igt_reset_queue+0x192/0x520 [i915]
<4> [1802.499964] __i915_subtests+0x5e/0xf0 [i915]
<4> [1802.500007] intel_hangcheck_live_selftests+0x60/0xa0 [i915]
<4> [1802.500037] __run_selftests+0x10b/0x190 [i915]
<4> [1802.500067] i915_live_selftests+0x2c/0x60 [i915]
<4> [1802.500093] i915_pci_probe+0x50/0xa0 [i915]
<4> [1802.500097] pci_device_probe+0xa1/0x130
<4> [1802.500100] really_probe+0x25d/0x3c0
<4> [1802.500102] driver_probe_device+0x10a/0x120
<4> [1802.500105] __driver_attach+0xdb/0x100
<4> [1802.500107] bus_for_each_dev+0x74/0xc0
<4> [1802.500109] bus_add_driver+0x15f/0x250
<4> [1802.500111] driver_register+0x56/0xe0
<4> [1802.500114] do_one_initcall+0x58/0x2e0
<4> [1802.500116] do_init_module+0x56/0x1ea
<4> [1802.500119] load_module+0x26f5/0x29d0
<4> [1802.500122] __se_sys_finit_module+0xd3/0xf0
<4> [1802.500124] do_syscall_64+0x55/0x190
<4> [1802.500126] entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [1802.500128] \x0aother info that might help us debug this:\x0a
<4> [1802.500131] Chain exists of:\x0a cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex\x0a
<4> [1802.500137] Possible unsafe locking scenario:\x0a
<4> [1802.500139] CPU0 CPU1
<4> [1802.500141] ---- ----
<4> [1802.500142] lock(&dev->struct_mutex);
<4> [1802.500145] lock(&mm->mmap_sem);
<4> [1802.500147] lock(&dev->struct_mutex);
<4> [1802.500149] lock(cpu_hotplug_lock.rw_sem);
The recursive lock here is unlikely, but could be hit if we pass a GTT
address into a perf read() call. Far fetched, but not impossible.
Replacing the struct_mutex to control the PTE insertion inside
i915_gem_fault() all end up in the same problem where we may wait on the
mutex/semaphore (e.g. via direct reclaim) and need the same mutex for the
reset.
-Chris
More information about the Intel-gfx
mailing list