[PATCH 24/29] mm: vmscan: make global slab shrink lockless

Thu Jun 22 15:12:02 UTC 2023

On 6/22/23 10:53, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
> 
> 1) When the memory pressure is high and there are many
>    filesystems mounted or unmounted at the same time,
>    slab shrink will be affected (down_read_trylock()
>    failed).
> 
>    Such as the real workload mentioned by Kirill Tkhai:
> 
>    ```
>    One of the real workloads from my experience is start
>    of an overcommitted node containing many starting
>    containers after node crash (or many resuming containers
>    after reboot for kernel update). In these cases memory
>    pressure is huge, and the node goes round in long reclaim.
>    ```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>    in [1]) and a writer comes in (such as mount a fs),
>    then this writer will be blocked and cause all
>    subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
> 
> We used to implement the lockless slab shrink with
> SRCU [1], but then kernel test robot reported -88.8%
> regression in stress-ng.ramfs.ops_per_sec test case [2],
> so we reverted it [3].
> 
> This commit uses the refcount+RCU method [4] proposed by
> by Dave Chinner to re-implement the lockless global slab
> shrink. The memcg slab shrink is handled in the subsequent
> patch.
> 
> Currently, the shrinker instances can be divided into
> the following three types:
> 
> a) global shrinker instance statically defined in the kernel,
> such as workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel
> modules, such as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed.
> For case b, the memory of shrinker instance will be freed
> after the module is unloaded. But we will call synchronize_rcu()
> in free_module() to wait for RCU read-side critical section to
> exit. For case c, the memory of shrinker instance will be
> dynamically freed by calling kfree_rcu(). So we can use
> rcu_read_{lock,unlock}() to ensure that the shrinker instance
> is valid.
> 
> The shrinker::refcount mechanism ensures that the shrinker
> instance will not be run again after unregistration. So the
> structure that records the pointer of shrinker instance can be
> safely freed without waiting for the RCU read-side critical
> section.
> 
> In this way, while we implement the lockless slab shrink, we
> don't need to be blocked in unregister_shrinker() to wait
> RCU read-side critical section.
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>  ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>  ramfs:
>           1 System Management Interrupt
>  for a 60.03s run time:
>     5762.40s available CPU time
>        7.71s user time   (  0.13%)
>      226.93s system time (  3.94%)
>      234.64s total time  (  4.07%)
>  load average: 8.54 3.06 2.11
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.03s (1 min, 0.03 secs)
> 
> 2) After applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>  ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>  ramfs:
>           4 System Management Interrupts
>  for a 60.12s run time:
>     5771.95s available CPU time
>        7.44s user time   (  0.13%)
>      230.22s system time (  3.99%)
>      237.66s total time  (  4.12%)
>  load average: 8.18 2.43 0.84
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.12s (1 min, 0.12 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> ---
>  include/linux/shrinker.h |  6 ++++++
>  mm/vmscan.c              | 33 ++++++++++++++-------------------
>  2 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 7bfeb2f25246..b0c6c2df9db8 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -74,6 +74,7 @@ struct shrinker {
>  
>  	refcount_t refcount;
>  	struct completion completion_wait;
> +	struct rcu_head rcu;
>  
>  	void *private_data;
>  
> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>  void shrinker_free(struct shrinker *shrinker);
>  void unregister_and_free_shrinker(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> +	return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
>  static inline void shrinker_put(struct shrinker *shrinker)
>  {
>  	if (refcount_dec_and_test(&shrinker->refcount))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6f9c4750effa..767569698946 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/rculist.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>  	down_write(&shrinker_rwsem);
>  	refcount_set(&shrinker->refcount, 1);
>  	init_completion(&shrinker->completion_wait);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
>  	up_write(&shrinker_rwsem);
> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>  	wait_for_completion(&shrinker->completion_wait);
>  
>  	down_write(&shrinker_rwsem);
> -	list_del(&shrinker->list);
> +	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>  void unregister_and_free_shrinker(struct shrinker *shrinker)
>  {
>  	unregister_shrinker(shrinker);
> -	kfree(shrinker);
> +	kfree_rcu(shrinker, rcu);
>  }
>  EXPORT_SYMBOL(unregister_and_free_shrinker);
>  
> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		goto out;
> -
> -	list_for_each_entry(shrinker, &shrinker_list, list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
>  			.memcg = memcg,
>  		};
>  
> +		if (!shrinker_try_get(shrinker))
> +			continue;
> +		rcu_read_unlock();

I don't think you can do this unlock?

> +
>  		ret = do_shrink_slab(&sc, shrinker, priority);
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> -		/*
> -		 * Bail out if someone want to register a new shrinker to
> -		 * prevent the registration from being stalled for long periods
> -		 * by parallel ongoing shrinking.
> -		 */
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
> -	}
>  
> -	up_read(&shrinker_rwsem);
> -out:
> +		rcu_read_lock();

That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
safe.

IIUC this is why Dave in [4] suggests unifying shrink_slab() with
shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.

> +		shrinker_put(shrinker);
> +	}
> +	rcu_read_unlock();
>  	cond_resched();
>  	return freed;
>  }