[Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue?
Rodrigo Vivi
rodrigo.vivi at intel.com
Thu Jun 30 13:59:00 UTC 2022
On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote:
>
> On 30/06/2022 12:19, Tetsuo Handa wrote:
> > On 2022/06/30 19:17, Tvrtko Ursulin wrote:
> > > Could you give more context on reasoning here please? What is the difference
> > > between using the system_wq and flushing it from a random context? Or concern
> > > is about flushing from specific contexts?
> >
> > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of
> > problems with an example.
> >
> > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says:
> >
> > Tejun Heo commented that it makes no sense at all to call flush_workqueue()
> > on the shared WQs as the caller has no idea what it's gonna end up waiting for.
> >
> > The "Think twice before calling this function! It's very easy to get into trouble
> > if you don't take great care." warning message does not help avoiding problems.
> >
> > Let's change the direction to that developers had better use their local WQs
> > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable.
> >
> > Three patterns of problems are:
> >
> > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock
> > (circular locking dependency) problem.
> >
> > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock
> > (circular locking dependency) problem.
> >
> > (c) Even if there is no possibility of deadlock, flushing with specific locks
> > held can cause khungtaskd to complain.
> >
> > An example of (a):
> >
> > ext4 filesystem called flush_scheduled_work(), which meant to wait for only
> > work item scheduled by ext4 filesystem, tried to also wait for work item
> > scheduled by 9p filesystem.
> > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8
> >
> > Fixed by reverting the problematic patch.
> >
> > An example of (b):
> >
> > It is GFP_KERNEL context when module's __exit function is called. But whether
> > flush_workqueue() is called from restricted context depends on what locks does
> > the module's __exit function hold.
> >
> > If request_module() is called from some work function using one of system-wide WQs,
> > and flush_workqueue() is called on that WQ from module's __exit function, the kernel
> > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called
> > on system-wide WQs is the safer choice.
> >
> > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage")
> > is for drivers/input/mouse/psmouse-smbus.c .
> >
> > An example of (c):
> >
> > ath9k driver calls schedule_work() via request_firmware_nowait().
> > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833
> >
> > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion
> > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver).
> > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee
> >
> > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git
> > might be able to mitigate these problems. (Waiting for next merge window...)
>
> Okay, from 1b3ce51dde365296:
>
> "Flushing system-wide workqueues is dangerous and will be forbidden."
>
> Thank you, this exactly explains the motivation which is what I was after. I
> certainly agree there is a possibility for lock coupling via the shared wq
> so that is fine by me.
>
> > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c
> > > I am pretty sure can be removed. It is synchronized with the error capture side of
> > > things which is not required for the test to work.
> > >
> > > I can send a patch for that or you can, as you prefer?
> >
> > OK. Please send a patch for that, for that patch will go linux-next.git tree via
> > a tree for gpu/drm/i915 driver.
>
> Patch sent. If I am right the easiest solution was just to remove the flush.
> If I was wrong though I'll need to create a dedicated wq so we will see what
> our automated CI will say.
But besides of flush_scheduled_work() it looks like
we also need to take care of the flush_workqueue() calls, no?!
* i915_gem_drain_workqueue()
* intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq);
* i915_gem_pm.c: flush_workqueue(i915->wq);
and the display ones for
dev_priv->modeset_wq
i915->flip_wq
besides the flush_scheduled_work in intel_modeset_driver_remove_noirq
>
> Regards,
>
> Tvrtko
More information about the Intel-gfx
mailing list