[PATCH 2/2] Push active reclaim into its own kthread

Tue Jul 19 14:16:43 UTC 2022

From: Chris Wilson <chris.p.wilson at intel.com>

kswapd also includes fs_reclaim annotations which prohibit us from
waking the device to perform memory reclaim. In order to recover system
memory claimed by i915 while it is idle, spawn a background thread woken
by kswapd, to perform active reclaim and unbind pages.

We rarely see the warning as when doing swap testing, it is generally
with an active device hogging most of the memory, so infrequently
entering runtime suspend. When it does, we would see a lockdep splat
for the runtime resume:

<4> [437.542859] CPU: 0 PID: 93 Comm: kswapd0 Tainted: G U 5.19.0-rc6-CI_DRM_11876-g2305e0d00665+ #1
<4> [437.542866] Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-S ADP-S DDR4 UDIMM CRB, BIOS ADLSFWI1.R00.2374.A00.2109091630 09/09/2021
<4> [437.542875] Call Trace:
<4> [437.542877] <TASK>
<4> [437.542880] dump_stack_lvl+0x56/0x7b
<4> [437.542885] check_noncircular+0x12e/0x150
<4> [437.542890] ? stack_trace_save+0x46/0x70
<4> [437.542895] ? save_trace+0x3d/0x380
<4> [437.542900] __lock_acquire+0x15ad/0x2940
<4> [437.542905] lock_acquire+0xd3/0x310
<4> [437.542909] ? acpi_device_wakeup_disable+0x12/0x50
<4> [437.542915] __mutex_lock+0x97/0xf20
<4> [437.542919] ? acpi_device_wakeup_disable+0x12/0x50
<4> [437.542923] ? acpi_device_wakeup_disable+0x12/0x50
<4> [437.542927] ? mark_held_locks+0x48/0x70
<4> [437.542933] ? acpi_device_wakeup_disable+0x12/0x50
<4> [437.542936] acpi_device_wakeup_disable+0x12/0x50
<4> [437.542941] acpi_pm_set_device_wakeup+0x6e/0x100
<4> [437.542945] __pci_enable_wake+0x73/0xa0
<4> [437.542951] pci_pm_runtime_resume+0x45/0x90
<4> [437.542955] ? pci_pm_default_resume+0x20/0x20
<4> [437.542959] __rpm_callback+0x3d/0x110
<4> [437.542964] ? pci_pm_default_resume+0x20/0x20
<4> [437.542968] rpm_callback+0x54/0x60
<4> [437.542972] ? pci_pm_default_resume+0x20/0x20
<4> [437.542976] rpm_resume+0x54f/0x750
<4> [437.542981] __pm_runtime_resume+0x42/0x80
<4> [437.542985] __intel_runtime_pm_get+0x19/0x80 [i915]
<4> [437.543068] i915_gem_object_unbind+0x8f/0x3b0 [i915]
<4> [437.543175] i915_gem_shrink+0x634/0x850 [i915]
<4> [437.543282] i915_gem_shrinker_scan+0x3a/0xc0 [i915]
<4> [437.543384] shrink_slab.constprop.97+0x1a4/0x4f0
<4> [437.543390] shrink_node+0x21e/0x420
<4> [437.543394] balance_pgdat+0x241/0x5c0
<4> [437.543400] kswapd+0x229/0x4f0
<4> [437.543404] ? finish_swait+0x80/0x80
<4> [437.543408] ? balance_pgdat+0x5c0/0x5c0
<4> [437.543412] kthread+0xed/0x120
<4> [437.543415] ? kthread_complete_and_exit+0x20/0x20
<4> [437.543420] ret_from_fork+0x1f/0x30

Using a background kthread for active reclaim depends on the same
observation as for kswapd, that it is decoupled from direct reclaim and
so does not need to inherit the reclaim context. This decoupling is
because page reclamation does not wait for kswapd, it merely polls for
successful page allocation. Prolonged failure by kswapd, leads to oom.

The background shrinker will only be run when there is memory pressure,
so any allocations it performs will only add to the resource contention.
The shrinker itself does not require allocations, but we may need some
to perform runtime resume. (ACPI being the common example.) During those
allocations, we will first try direct reclaim with the i915 shrinker
as normal.

Mark up the background shrinker with memalloc and fs_reclaim. The first
denotes the kthread's purpose and allows us to dip into memory reserves;
the latter ensures that our shrinker remains consistent with the rules
for direct reclaim.

In addition to the issue with trying to do runtime-pm from inside
kswapd, Tvrtko has found that on CrOS the shrinker callbacks are inside
UX latency critical path. Each fork in CrOS is inside a new namespace,
which instantiates a new shrinker, and that blocks on waiting for
shrink_slabs. Ultimately the UI process is blocked waiting for a forked
process, which is waiting upon kswapd running the i915 shrinker. We move
the waits for active reclaim from direct reclaim to kswapd, believing
that would move it off latency sensitive paths for the rest of the
system. That has been demonstrated to be false, and so we need to move
the active reclaim outside of the shrinker entirely to avoid blocking
the UI.

Furthermore, Thomas has spotted a potential deadlock between kswapd and
suspend, where kswapd might be sleeping on a request holding a mutex
which we required during i915_gem_suspend. Since kswapd will be frozen
before suspending the device, even when we forcibly reset the device and
cancel the requests, kswapd will not be woken to release the mutex. This
issue is not addressed in this patch, the swapper kthread should have
identical [buggy] characteristics to kswapd, and we will need to
implement a freezer to break the deadlock before suspending the device.

Reported-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
Reported-by: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6449
Testcase: igt/i915_pm_rpm/gem-swap
Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
Cc: Matthew Auld <matthew.auld at intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
Signed-off-by: Chris Wilson <chris.p.wilson at intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c | 111 ++++++++++++++++---
 drivers/gpu/drm/i915/i915_drv.h              |  15 +++
 2 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
index 1030053571a2..6e1d02881c0c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
@@ -310,6 +310,92 @@ i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 	return count;
 }
 
+static int swapper(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	atomic_long_t *target = &i915->mm.swapper.target;
+	unsigned int noreclaim_state;
+
+	/*
+	 * For us to be running the swapper implies that the system is under
+	 * enough memory pressure to be swapping. At that point, we both want
+	 * to ensure we make forward progress in order to reclaim pages from
+	 * the device and not contribute further to direct reclaim pressure. We
+	 * mark ourselves as a memalloc task in order to not trigger direct
+	 * reclaim ourselves, but dip into the system memory reserves for
+	 * shrinkers.
+	 */
+	noreclaim_state = memalloc_noreclaim_save();
+
+	do {
+		intel_wakeref_t wakeref;
+
+		___wait_var_event(target,
+				  atomic_long_read(target) ||
+				  kthread_should_stop(),
+				  TASK_IDLE, 0, 0, schedule());
+		if (kthread_should_stop())
+			break;
+
+		with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
+			unsigned long nr_scan = atomic_long_xchg(target, 0);
+
+			/*
+			 * Now that we have woken up the device hierarchy,
+			 * act as a normal shrinker. Our shrinker is primarily
+			 * focussed on supporting direct reclaim (low latency,
+			 * avoiding contention that may lead to more reclaim,
+			 * or prevent that reclaim from making forward progress)
+			 * and we wish to continue that good practice even
+			 * here where we could accidentally sleep holding locks.
+			 *
+			 * Let lockdep know and warn us about any bad practice
+			 * that may lead to high latency in direct reclaim, or
+			 * anywhere else.
+			 *
+			 * While the swapper is active, direct reclaim from
+			 * other threads will also be running in parallel
+			 * through i915_gem_shrink(), scouring for idle pages.
+			 */
+			fs_reclaim_acquire(GFP_KERNEL);
+			i915_gem_shrink(NULL, i915,
+					nr_scan, &nr_scan,
+					I915_SHRINK_ACTIVE |
+					I915_SHRINK_BOUND |
+					I915_SHRINK_UNBOUND |
+					I915_SHRINK_WRITEBACK);
+			fs_reclaim_release(GFP_KERNEL);
+		}
+	} while (1);
+
+	memalloc_noreclaim_restore(noreclaim_state);
+	return 0;
+}
+
+static void start_swapper(struct drm_i915_private *i915)
+{
+	i915->mm.swapper.tsk = kthread_run(swapper, i915, "i915-swapd");
+	if (IS_ERR(i915->mm.swapper.tsk))
+		drm_err(&i915->drm,
+			"Failed to launch swapper; memory reclaim may be degraded\n");
+}
+
+static void kick_swapper(struct drm_i915_private *i915, long nr_scan)
+{
+	if (!atomic_long_fetch_add(nr_scan, &i915->mm.swapper.target))
+		wake_up_var(&i915->mm.swapper.target);
+}
+
+static void stop_swapper(struct drm_i915_private *i915)
+{
+	struct task_struct *tsk = fetch_and_zero(&i915->mm.swapper.tsk);
+
+	if (IS_ERR_OR_NULL(tsk))
+		return;
+
+	kthread_stop(tsk);
+}
+
 static unsigned long
 i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 {
@@ -318,27 +404,20 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 	unsigned long freed;
 
 	sc->nr_scanned = 0;
-
 	freed = i915_gem_shrink(NULL, i915,
 				sc->nr_to_scan,
 				&sc->nr_scanned,
 				I915_SHRINK_BOUND |
 				I915_SHRINK_UNBOUND);
-	if (sc->nr_scanned < sc->nr_to_scan && current_is_kswapd()) {
-		intel_wakeref_t wakeref;
+	if (!sc->nr_scanned) /* nothing left to reclaim */
+		return SHRINK_STOP;
 
-		with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
-			freed += i915_gem_shrink(NULL, i915,
-						 sc->nr_to_scan - sc->nr_scanned,
-						 &sc->nr_scanned,
-						 I915_SHRINK_ACTIVE |
-						 I915_SHRINK_BOUND |
-						 I915_SHRINK_UNBOUND |
-						 I915_SHRINK_WRITEBACK);
-		}
-	}
+	/* Pages still bound and system is failing with direct reclaim? */
+	if (sc->nr_scanned < sc->nr_to_scan && current_is_kswapd())
+		/* Defer high latency tasks to a background thread. */
+		kick_swapper(i915, sc->nr_to_scan - sc->nr_scanned);
 
-	return sc->nr_scanned ? freed : SHRINK_STOP;
+	return freed;
 }
 
 static int
@@ -434,10 +513,14 @@ void i915_gem_driver_register__shrinker(struct drm_i915_private *i915)
 	i915->mm.vmap_notifier.notifier_call = i915_gem_shrinker_vmap;
 	drm_WARN_ON(&i915->drm,
 		    register_vmap_purge_notifier(&i915->mm.vmap_notifier));
+
+	start_swapper(i915);
 }
 
 void i915_gem_driver_unregister__shrinker(struct drm_i915_private *i915)
 {
+	stop_swapper(i915);
+
 	drm_WARN_ON(&i915->drm,
 		    unregister_vmap_purge_notifier(&i915->mm.vmap_notifier));
 	drm_WARN_ON(&i915->drm,
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3364a6e5169b..976983ab67a9 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -283,6 +283,21 @@ struct i915_gem_mm {
 	/* shrinker accounting, also useful for userland debugging */
 	u64 shrink_memory;
 	u32 shrink_count;
+
+	/* background task for returning bound system pages */
+	struct {
+		struct task_struct *tsk; /* our kswapd equivalent */
+
+		/*
+		 * Track the number of pages do_shrink_slab() has asked us
+		 * to reclaim, and we have failed to find. This count of
+		 * outstanding reclaims is passed to the swapper thread,
+		 * which then blocks as it tries to achieve that goal.
+		 * It is likely that the target overshoots due to the
+		 * latency between our thread and kswapd making new requests.
+		 */
+		atomic_long_t target;
+	} swapper;
 };
 
 #define I915_IDLE_ENGINES_TIMEOUT (200) /* in ms */
-- 
2.25.1