[PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H
Daniel Vetter
daniel at ffwll.ch
Mon Aug 16 15:39:13 UTC 2021
On Fri, Aug 13, 2021 at 07:02:55PM +0000, Matthew Brost wrote:
> On Fri, Aug 13, 2021 at 05:11:59PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote:
> > > > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote:
> > > > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset.
> > > > > > > This is accomplished by spinning on the the list of outstanding G2H to
> > > > > > > go empty.
> > > > > > >
> > > > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface")
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > > > > > Cc: <stable at vger.kernel.org>
> > > > > > > ---
> > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++
> > > > > > > 1 file changed, 5 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
> > > > > > > wait_for_reset(guc, &guc->outstanding_submission_g2h);
> > > > > > > } while (!list_empty(&guc->ct.requests.incoming));
> > > > > > > }
> > > > > > > +
> > > > > > > + /* Flush any GuC generated G2H */
> > > > > > > + while (!list_empty(&guc->ct.requests.incoming))
> > > > > > > + msleep(20);
> > > > > >
> > > > > > flush_work or flush_workqueue, beacuse that comes with lockdep
> > > > > > annotations. Dont hand-roll stuff like this if at all possible.
> > > > >
> > > > > lockdep puked when used that.
> > > >
> > > > Lockdep tends to be right ... So we definitely want that, but maybe a
> > > > different flavour, or there's something wrong with the workqueue setup.
> > > >
> > >
> > > Here is a dependency chain that lockdep doesn't like.
> > >
> > > fs_reclaim_acquire -> >->reset.mutex (shrinker)
> > > workqueue -> fs_reclaim_acquire (error capture in workqueue)
> > > >->reset.mutex -> workqueue (reset)
> > >
> > > In practice I don't think we couldn't ever hit this but lockdep does
> > > looks right here. Trying to work out how to fix this. We really need to
> > > all G2H to done being processed before we proceed during a reset or we
> > > have races. Have a few ideas of how to handle this but can't convince
> > > myself any of them are fully safe.
> >
> > It might be false sharing due to a single workqueue, or a single-threaded
> > workqueue.
> >
> > Essentially the lockdep annotations for work_struct track two things:
> > - dependencies against the specific work item
> > - dependencies against anything queued on that work queue, if you flush
> > the entire queue, or if you flush a work item that's on a
> > single-threaded queue.
> >
> > Because if guc/host communication is inverted like this here, you have a
> > much bigger problem.
> >
> > Note that if you pick a different workqueue for your guc work stuff then
> > you need to make sure that's all properly flushed on suspend and driver
> > unload.
> >
> > It might also be that the reset work is on the wrong workqueue.
> >
> > Either way, this must be fixed, because I've seen too many of these "it
> > never happens in practice" blow up, plus if your locking scheme is
> > engineered with quicksand forget about anyone ever understanding it.
>
> The solution is to allocate memory for the error capture in an atomic
> context if the error capture is being done from the G2H work queue. That
> means this can possibly fail if the system is under memory pressure but
> that is better than a lockdep splat.
Ah yeah if this is for error capture then GFP_ATOMIC is the right option.
-Daniel
>
> Matt
>
> > -Daniel
> >
> > >
> > > Splat below:
> > >
> > > [ 154.625989] ======================================================
> > > [ 154.632195] WARNING: possible circular locking dependency detected
> > > [ 154.638393] 5.14.0-rc5-guc+ #50 Tainted: G U
> > > [ 154.643991] ------------------------------------------------------
> > > [ 154.650196] i915_selftest/1673 is trying to acquire lock:
> > > [ 154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0
> > > [ 154.665826]
> > > but task is already holding lock:
> > > [ 154.671682] ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > > [ 154.680659]
> > > which lock already depends on the new lock.
> > >
> > > [ 154.688857]
> > > the existing dependency chain (in reverse order) is:
> > > [ 154.696365]
> > > -> #2 (>->reset.mutex){+.+.}-{3:3}:
> > > [ 154.702571] lock_acquire+0xd2/0x300
> > > [ 154.706695] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
> > > [ 154.712959] intel_gt_init_reset+0x61/0x80 [i915]
> > > [ 154.718258] intel_gt_init_early+0xe6/0x120 [i915]
> > > [ 154.723648] i915_driver_probe+0x592/0xdc0 [i915]
> > > [ 154.728942] i915_pci_probe+0x43/0x1c0 [i915]
> > > [ 154.733891] pci_device_probe+0x9b/0x110
> > > [ 154.738362] really_probe+0x1a6/0x3a0
> > > [ 154.742568] __driver_probe_device+0xf9/0x170
> > > [ 154.747468] driver_probe_device+0x19/0x90
> > > [ 154.752114] __driver_attach+0x99/0x170
> > > [ 154.756492] bus_for_each_dev+0x73/0xc0
> > > [ 154.760870] bus_add_driver+0x14b/0x1f0
> > > [ 154.765248] driver_register+0x67/0xb0
> > > [ 154.769542] i915_init+0x18/0x8c [i915]
> > > [ 154.773964] do_one_initcall+0x53/0x2e0
> > > [ 154.778343] do_init_module+0x56/0x210
> > > [ 154.782639] load_module+0x25fc/0x29f0
> > > [ 154.786934] __do_sys_finit_module+0xae/0x110
> > > [ 154.791835] do_syscall_64+0x38/0xc0
> > > [ 154.795958] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 154.801558]
> > > -> #1 (fs_reclaim){+.+.}-{0:0}:
> > > [ 154.807241] lock_acquire+0xd2/0x300
> > > [ 154.811361] fs_reclaim_acquire+0x9e/0xd0
> > > [ 154.815914] kmem_cache_alloc_trace+0x30/0x790
> > > [ 154.820899] i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
> > > [ 154.826649] i915_gpu_coredump+0x39/0x560 [i915]
> > > [ 154.831866] i915_capture_error_state+0xa/0x70 [i915]
> > > [ 154.837513] intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
> > > [ 154.844383] ct_incoming_request_worker_func+0x130/0x1b0 [i915]
> > > [ 154.850898] process_one_work+0x264/0x590
> > > [ 154.855451] worker_thread+0x4b/0x3a0
> > > [ 154.859655] kthread+0x147/0x170
> > > [ 154.863428] ret_from_fork+0x1f/0x30
> > > [ 154.867548]
> > > -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
> > > [ 154.875747] check_prev_add+0x90/0xc30
> > > [ 154.880042] __lock_acquire+0x1643/0x2110
> > > [ 154.884595] lock_acquire+0xd2/0x300
> > > [ 154.888715] __flush_work+0x373/0x4d0
> > > [ 154.892920] intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > > [ 154.899606] intel_uc_reset_prepare+0x40/0x50 [i915]
> > > [ 154.905166] reset_prepare+0x55/0x60 [i915]
> > > [ 154.909946] intel_gt_reset+0x11c/0x300 [i915]
> > > [ 154.914984] do_device_reset+0x13/0x20 [i915]
> > > [ 154.919936] check_whitelist_across_reset+0x166/0x250 [i915]
> > > [ 154.926212] live_reset_whitelist.cold+0x6a/0x7a [i915]
> > > [ 154.932037] __i915_subtests.cold+0x20/0x74 [i915]
> > > [ 154.937428] __run_selftests.cold+0x96/0xee [i915]
> > > [ 154.942816] i915_live_selftests+0x2c/0x60 [i915]
> > > [ 154.948125] i915_pci_probe+0x93/0x1c0 [i915]
> > > [ 154.953076] pci_device_probe+0x9b/0x110
> > > [ 154.957545] really_probe+0x1a6/0x3a0
> > > [ 154.961749] __driver_probe_device+0xf9/0x170
> > > [ 154.966653] driver_probe_device+0x19/0x90
> > > [ 154.971290] __driver_attach+0x99/0x170
> > > [ 154.975671] bus_for_each_dev+0x73/0xc0
> > > [ 154.980053] bus_add_driver+0x14b/0x1f0
> > > [ 154.984431] driver_register+0x67/0xb0
> > > [ 154.988725] i915_init+0x18/0x8c [i915]
> > > [ 154.993149] do_one_initcall+0x53/0x2e0
> > > [ 154.997527] do_init_module+0x56/0x210
> > > [ 155.001822] load_module+0x25fc/0x29f0
> > > [ 155.006118] __do_sys_finit_module+0xae/0x110
> > > [ 155.011019] do_syscall_64+0x38/0xc0
> > > [ 155.015139] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 155.020729]
> > > other info that might help us debug this:
> > >
> > > [ 155.028752] Chain exists of:
> > > (work_completion)(&ct->requests.worker) --> fs_reclaim --> >->reset.mutex
> > >
> > > [ 155.041294] Possible unsafe locking scenario:
> > >
> > > [ 155.047240] CPU0 CPU1
> > > [ 155.051791] ---- ----
> > > [ 155.056344] lock(>->reset.mutex);
> > > [ 155.060026] lock(fs_reclaim);
> > > [ 155.065706] lock(>->reset.mutex);
> > > [ 155.071912] lock((work_completion)(&ct->requests.worker));
> > > [ 155.077595]
> > > *** DEADLOCK ***
> > >
> > > [ 155.083534] 2 locks held by i915_selftest/1673:
> > > [ 155.088086] #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170
> > > [ 155.096460] #1: ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915]
> > > [ 155.105845]
> > > stack backtrace:
> > > [ 155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G U 5.14.0-rc5-guc+ #50
> > > [ 155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020
> > > [ 155.132530] Call Trace:
> > > [ 155.134999] dump_stack_lvl+0x57/0x7d
> > > [ 155.138690] check_noncircular+0x12d/0x150
> > > [ 155.142814] check_prev_add+0x90/0xc30
> > > [ 155.146587] __lock_acquire+0x1643/0x2110
> > > [ 155.150618] lock_acquire+0xd2/0x300
> > > [ 155.154218] ? __flush_work+0x350/0x4d0
> > > [ 155.158073] ? __lock_acquire+0x5f3/0x2110
> > > [ 155.162194] __flush_work+0x373/0x4d0
> > > [ 155.165883] ? __flush_work+0x350/0x4d0
> > > [ 155.169739] ? mark_held_locks+0x49/0x70
> > > [ 155.173686] ? _raw_spin_unlock_irqrestore+0x50/0x70
> > > [ 155.178679] intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
> > > [ 155.184857] ? _raw_spin_unlock_irqrestore+0x50/0x70
> > > [ 155.189843] intel_uc_reset_prepare+0x40/0x50 [i915]
> > > [ 155.194891] reset_prepare+0x55/0x60 [i915]
> > > [ 155.199145] intel_gt_reset+0x11c/0x300 [i915]
> > > [ 155.203657] ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > > [ 155.208641] ? do_engine_reset+0x10/0x10 [i915]
> > > [ 155.213243] do_device_reset+0x13/0x20 [i915]
> > > [ 155.217664] check_whitelist_across_reset+0x166/0x250 [i915]
> > > [ 155.223415] live_reset_whitelist.cold+0x6a/0x7a [i915]
> > > [ 155.228725] __i915_subtests.cold+0x20/0x74 [i915]
> > > [ 155.233593] ? __i915_live_teardown+0x50/0x50 [i915]
> > > [ 155.238644] ? __intel_gt_live_setup+0x30/0x30 [i915]
> > > [ 155.243773] __run_selftests.cold+0x96/0xee [i915]
> > > [ 155.248646] i915_live_selftests+0x2c/0x60 [i915]
> > > [ 155.253425] i915_pci_probe+0x93/0x1c0 [i915]
> > > [ 155.257854] ? _raw_spin_unlock_irqrestore+0x3d/0x70
> > > [ 155.262839] pci_device_probe+0x9b/0x110
> > > [ 155.266785] really_probe+0x1a6/0x3a0
> > > [ 155.270467] __driver_probe_device+0xf9/0x170
> > > [ 155.274846] driver_probe_device+0x19/0x90
> > > [ 155.278968] __driver_attach+0x99/0x170
> > > [ 155.282824] ? __device_attach_driver+0xd0/0xd0
> > > [ 155.287376] ? __device_attach_driver+0xd0/0xd0
> > > [ 155.291928] bus_for_each_dev+0x73/0xc0
> > > [ 155.295792] bus_add_driver+0x14b/0x1f0
> > > [ 155.299648] driver_register+0x67/0xb0
> > > [ 155.303419] i915_init+0x18/0x8c [i915]
> > > [ 155.307328] ? 0xffffffffa0188000
> > > [ 155.310669] do_one_initcall+0x53/0x2e0
> > > [ 155.314525] ? rcu_read_lock_sched_held+0x4d/0x80
> > > [ 155.319253] ? kmem_cache_alloc_trace+0x5ad/0x790
> > > [ 155.323982] do_init_module+0x56/0x210
> > > [ 155.327754] load_module+0x25fc/0x29f0
> > > [ 155.331528] ? __do_sys_finit_module+0xae/0x110
> > > [ 155.336081] __do_sys_finit_module+0xae/0x110
> > > [ 155.340462] do_syscall_64+0x38/0xc0
> > > [ 155.344060] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 155.349137] RIP: 0033:0x7efc4496389d
> > > [ 155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> > > [ 155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > [ 155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d
> > > [ 155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004
> > > [ 155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0
> > > [ 155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610
> > > [ 155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240
> > >
> > > > This is the major reason why inventing any wheels locally in the kernel
> > > > isn't a good idea - aside from the duplicated code because likely there is
> > > > a solution for whatever you need. There's all the validation tools,
> > > > careful thought about semantics (especially around races) and all that
> > > > stuff that you're always missing on your hand-rolled thing. Aside from
> > > > anything hand-rolled is much harder to read, since intent isn't clear.
> > > > -Daniel
> > > >
> > > >
> > > > >
> > > > > Matt
> > > > >
> > > > > > -Daniel
> > > > > >
> > > > > > > +
> > > > > > > scrub_guc_desc_for_outstanding_g2h(guc);
> > > > > > > }
> > > > > > >
> > > > > > > --
> > > > > > > 2.32.0
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > http://blog.ffwll.ch
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the dri-devel
mailing list