[Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue?

Thu Jun 30 11:19:34 UTC 2022

On 2022/06/30 19:17, Tvrtko Ursulin wrote:
> Could you give more context on reasoning here please? What is the difference
> between using the system_wq and flushing it from a random context? Or concern
> is about flushing from specific contexts?

Excuse me, but I couldn't catch what you want. Thus, I show three patterns of
problems with an example.

Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says:

  Tejun Heo commented that it makes no sense at all to call flush_workqueue()
  on the shared WQs as the caller has no idea what it's gonna end up waiting for.

  The "Think twice before calling this function! It's very easy to get into trouble
  if you don't take great care." warning message does not help avoiding problems.

  Let's change the direction to that developers had better use their local WQs
  if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable.

Three patterns of problems are:

  (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock
      (circular locking dependency) problem.

  (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock
      (circular locking dependency) problem.

  (c) Even if there is no possibility of deadlock, flushing with specific locks
      held can cause khungtaskd to complain.

An example of (a):

  ext4 filesystem called flush_scheduled_work(), which meant to wait for only
  work item scheduled by ext4 filesystem, tried to also wait for work item
  scheduled by 9p filesystem.
  https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8

  Fixed by reverting the problematic patch.

An example of (b):

  It is GFP_KERNEL context when module's __exit function is called. But whether
  flush_workqueue() is called from restricted context depends on what locks does
  the module's __exit function hold.

  If request_module() is called from some work function using one of system-wide WQs,
  and flush_workqueue() is called on that WQ from module's __exit function, the kernel
  might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called
  on system-wide WQs is the safer choice.

  Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage")
  is for drivers/input/mouse/psmouse-smbus.c .

An example of (c):

  ath9k driver calls schedule_work() via request_firmware_nowait().
  https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833

  ath6kl driver calls flush_scheduled_work() which needlessly waits for completion
  of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver).
  https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee

  Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git
  might be able to mitigate these problems. (Waiting for next merge window...)

> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c
> I am pretty sure can be removed. It is synchronized with the error capture side of
> things which is not required for the test to work.
> 
> I can send a patch for that or you can, as you prefer?

OK. Please send a patch for that, for that patch will go linux-next.git tree via
a tree for gpu/drm/i915 driver.