[Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue?

Thu Jun 30 13:09:02 UTC 2022

On 30/06/2022 12:19, Tetsuo Handa wrote:
> On 2022/06/30 19:17, Tvrtko Ursulin wrote:
>> Could you give more context on reasoning here please? What is the difference
>> between using the system_wq and flushing it from a random context? Or concern
>> is about flushing from specific contexts?
> 
> Excuse me, but I couldn't catch what you want. Thus, I show three patterns of
> problems with an example.
> 
> Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says:
> 
>    Tejun Heo commented that it makes no sense at all to call flush_workqueue()
>    on the shared WQs as the caller has no idea what it's gonna end up waiting for.
> 
>    The "Think twice before calling this function! It's very easy to get into trouble
>    if you don't take great care." warning message does not help avoiding problems.
> 
>    Let's change the direction to that developers had better use their local WQs
>    if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable.
> 
> Three patterns of problems are:
> 
>    (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock
>        (circular locking dependency) problem.
> 
>    (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock
>        (circular locking dependency) problem.
> 
>    (c) Even if there is no possibility of deadlock, flushing with specific locks
>        held can cause khungtaskd to complain.
> 
> An example of (a):
> 
>    ext4 filesystem called flush_scheduled_work(), which meant to wait for only
>    work item scheduled by ext4 filesystem, tried to also wait for work item
>    scheduled by 9p filesystem.
>    https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8
> 
>    Fixed by reverting the problematic patch.
> 
> An example of (b):
> 
>    It is GFP_KERNEL context when module's __exit function is called. But whether
>    flush_workqueue() is called from restricted context depends on what locks does
>    the module's __exit function hold.
> 
>    If request_module() is called from some work function using one of system-wide WQs,
>    and flush_workqueue() is called on that WQ from module's __exit function, the kernel
>    might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called
>    on system-wide WQs is the safer choice.
> 
>    Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage")
>    is for drivers/input/mouse/psmouse-smbus.c .
> 
> An example of (c):
> 
>    ath9k driver calls schedule_work() via request_firmware_nowait().
>    https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833
> 
>    ath6kl driver calls flush_scheduled_work() which needlessly waits for completion
>    of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver).
>    https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee
> 
>    Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git
>    might be able to mitigate these problems. (Waiting for next merge window...)

Okay, from 1b3ce51dde365296:

  "Flushing system-wide workqueues is dangerous and will be forbidden."

Thank you, this exactly explains the motivation which is what I was 
after. I certainly agree there is a possibility for lock coupling via 
the shared wq so that is fine by me.

>> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c
>> I am pretty sure can be removed. It is synchronized with the error capture side of
>> things which is not required for the test to work.
>>
>> I can send a patch for that or you can, as you prefer?
> 
> OK. Please send a patch for that, for that patch will go linux-next.git tree via
> a tree for gpu/drm/i915 driver.

Patch sent. If I am right the easiest solution was just to remove the 
flush. If I was wrong though I'll need to create a dedicated wq so we 
will see what our automated CI will say.

Regards,

Tvrtko