[PATCH] drm/ttm: stop warning on TT shrinker failure
Daniel Vetter
daniel at ffwll.ch
Tue Mar 23 13:15:05 UTC 2021
On Tue, Mar 23, 2021 at 01:04:03PM +0100, Michal Hocko wrote:
> On Tue 23-03-21 12:48:58, Christian König wrote:
> > Am 23.03.21 um 12:28 schrieb Daniel Vetter:
> > > On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
> > > > I think this is where I don't get yet what Christian tries to do: We
> > > > really shouldn't do different tricks and calling contexts between direct
> > > > reclaim and kswapd reclaim. Otherwise very hard to track down bugs are
> > > > pretty much guaranteed. So whether we use explicit gfp flags or the
> > > > context apis, result is exactly the same.
> >
> > Ok let us recap what TTMs TT shrinker does here:
> >
> > 1. We got memory which is not swapable because it might be accessed by the
> > GPU at any time.
> > 2. Make sure the memory is not accessed by the GPU and driver need to grab a
> > lock before they can make it accessible again.
> > 3. Allocate a shmem file and copy over the not swapable pages.
>
> This is quite tricky because the shrinker operates in the PF_MEMALLOC
> context so such an allocation would be allowed to completely deplete
> memory unless you explicitly mark that context as __GFP_NOMEMALLOC. Also
> note that if the allocation cannot succeed it will not trigger reclaim
> again because you are already called from the reclaim context.
[Limiting to that discussion]
Yes it's not emulating real (direct) reclaim correctly, but ime the
biggest issue with direct reclaim is when you do mutex_lock instead of
mutex_trylock or in general block on stuff that you cant. And lockdep +
fs_reclaim annotations gets us that, so pretty good to make sure our
shrinker is correct.
Actual tuning of it and making sure it's not doing silly things is ofc a
different thing, and for that we can't test it in isolation. But it's good
to know that before you tune it, you have rather high confidence it's
at least correct. And for that not running with PF_MEMALLOC is actually
good, since it means more allocation failures, so more testing of those
error/backoff paths in the code.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the amd-gfx
mailing list