[Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue?
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu Jun 30 13:09:02 UTC 2022
On 30/06/2022 12:19, Tetsuo Handa wrote:
> On 2022/06/30 19:17, Tvrtko Ursulin wrote:
>> Could you give more context on reasoning here please? What is the difference
>> between using the system_wq and flushing it from a random context? Or concern
>> is about flushing from specific contexts?
>
> Excuse me, but I couldn't catch what you want. Thus, I show three patterns of
> problems with an example.
>
> Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says:
>
> Tejun Heo commented that it makes no sense at all to call flush_workqueue()
> on the shared WQs as the caller has no idea what it's gonna end up waiting for.
>
> The "Think twice before calling this function! It's very easy to get into trouble
> if you don't take great care." warning message does not help avoiding problems.
>
> Let's change the direction to that developers had better use their local WQs
> if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable.
>
> Three patterns of problems are:
>
> (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock
> (circular locking dependency) problem.
>
> (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock
> (circular locking dependency) problem.
>
> (c) Even if there is no possibility of deadlock, flushing with specific locks
> held can cause khungtaskd to complain.
>
> An example of (a):
>
> ext4 filesystem called flush_scheduled_work(), which meant to wait for only
> work item scheduled by ext4 filesystem, tried to also wait for work item
> scheduled by 9p filesystem.
> https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8
>
> Fixed by reverting the problematic patch.
>
> An example of (b):
>
> It is GFP_KERNEL context when module's __exit function is called. But whether
> flush_workqueue() is called from restricted context depends on what locks does
> the module's __exit function hold.
>
> If request_module() is called from some work function using one of system-wide WQs,
> and flush_workqueue() is called on that WQ from module's __exit function, the kernel
> might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called
> on system-wide WQs is the safer choice.
>
> Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage")
> is for drivers/input/mouse/psmouse-smbus.c .
>
> An example of (c):
>
> ath9k driver calls schedule_work() via request_firmware_nowait().
> https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833
>
> ath6kl driver calls flush_scheduled_work() which needlessly waits for completion
> of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver).
> https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee
>
> Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git
> might be able to mitigate these problems. (Waiting for next merge window...)
Okay, from 1b3ce51dde365296:
"Flushing system-wide workqueues is dangerous and will be forbidden."
Thank you, this exactly explains the motivation which is what I was
after. I certainly agree there is a possibility for lock coupling via
the shared wq so that is fine by me.
>> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c
>> I am pretty sure can be removed. It is synchronized with the error capture side of
>> things which is not required for the test to work.
>>
>> I can send a patch for that or you can, as you prefer?
>
> OK. Please send a patch for that, for that patch will go linux-next.git tree via
> a tree for gpu/drm/i915 driver.
Patch sent. If I am right the easiest solution was just to remove the
flush. If I was wrong though I'll need to create a dedicated wq so we
will see what our automated CI will say.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list