[Intel-gfx] [PATCH v2] drm/i915: Remove __GFP_NORETRY from our buffer allocator

Tue Jun 6 09:04:45 UTC 2017

On Mon 05-06-17 14:39:17, Chris Wilson wrote:
> Quoting Michal Hocko (2017-06-05 14:08:10)
> > On Mon 05-06-17 13:49:38, Chris Wilson wrote:
> > > Quoting Michal Hocko (2017-06-05 13:26:30)
> > > > On Mon 05-06-17 11:35:12, Chris Wilson wrote:
> > > > > I tried __GFP_NORETRY in the belief that __GFP_RECLAIM was effective. It
> > > > > struggles with handling reclaim via kswapd (through inconsistency within
> > > > > throttle_direct_reclaim() and even then the race between multiple
> > > > > allocators makes the two step of reclaim then allocate fragile), and as
> > > > > our buffers are always dirty (with very few exceptions), we required
> > > > > kswapd to perform pageout on them. The only effective means of waiting
> > > > > on kswapd is to retry the allocations (i.e. not set __GFP_NORETRY). That
> > > > > leaves us with the dilemma of invoking the oomkiller instead of
> > > > > propagating the allocation failure back to userspace where it can be
> > > > > handled more gracefully (one hopes).  In the future we may have
> > > > > __GFP_MAYFAIL to allow repeats up until we genuinely run out of memory
> > > > > and the oomkiller would have been invoked. Until then, let the oomkiller
> > > > > wreck havoc.
> > > > > 
> > > > > v2: Stop playing with side-effects of gfp flags and await __GFP_MAYFAIL
> > > > > 
> > > > > Fixes: 24f8e00a8a2e ("drm/i915: Prefer to report ENOMEM rather than incur the oom for gfx allocations")
> > > > > Testcase: igt/gem_tiled_swapping
> > > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > > > > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > > > > Cc: Michal Hocko <mhocko at suse.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/i915_gem.c | 15 ++++++++++++++-
> > > > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > index 7286f5dd3e64..845df6067e90 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > @@ -2406,7 +2406,20 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
> > > > >                       if (!*s) {
> > > > >                               /* reclaim and warn, but no oom */
> > > > >                               gfp = mapping_gfp_mask(mapping);
> > > > > -                             gfp |= __GFP_NORETRY;
> > > > > +
> > > > > +                             /* Our bo are always dirty and so we require
> > > > > +                              * kswapd to reclaim our pages (direct reclaim
> > > > > +                              * performs no swapping on its own). However,
> > > > 
> > > > Not sure whether this is exactly what you mean. The only pageout the
> > > > direct reclaim is allowed to the swap partition (so anonymous and
> > > > shmem). So the above is not 100% correct.
> > > 
> > > Hmm, I didn't see anything that allows direct reclaim to perform
> > > writeback into swap.
> 
> Hmm, maybe I was looking too hard at reclaim_clean_pages_from_list(), or
> at least I was looking for ways I could start pageout earlier.
>  
> > shrink_page_list
> >   add_to_swap
> >     add_to_swap_cache (gains mapping see page_mapping)
> >   pageout
> >     mapping->a_ops->writepage = shmem_writepage
> >       swap_writepage
> 
> Right, if we have sc->write_page it should be hitting here the first
> time it sees a dirty page on a reclaim list. Hmm, I interpreted the
> "only if nonblocking" to mean that we wouldn't swapout for direct
> reclaim. Perhaps a better question then is why doesn't shmemfs make
> better progress for writeback under direct reclaim?

I assume that is because our reclaim is highly pagecache biased and we
try to not swap as much as possible (see get_scan_count). So it is quite
possible that we even do not age anonymous LRU lists because you have
sufficient amount of the page cache to reclaim.

> > note that the regular writeback is not allowed by 
> > shrink_page_list:
> >                         if (page_is_file_cache(page) &&
> >                             (!current_is_kswapd() || !PageReclaim(page) ||
> >                              !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> >                                 /*
> >                                  * Immediately reclaim when written back.
> >                                  * Similar in principal to deactivate_page()
> >                                  * except we already have the page isolated
> >                                  * and know it's dirty
> >                                  */
> >                                 inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
> >                                 SetPageReclaim(page);
> > 
> >                                 goto activate_locked;
> >                         }
> > called right before pageout()
> 
> At one point I thought that was just there to make my life more
> miserable. But it's purpose is clear, no need to write a clean copy of a
> file to swap when it can be read back in from the file later.

Well the primary reason for this code to exist is the terrible IO
patterns when submitting scattered data and also kernel stack exhaustion
because the page allocator is called from quite a deep call chains and
any more complex storage stack is likely to consume a lot on top. Swap
is special and much simpler in many ways.

> > > The issue for us (i915) is that our buffers are
> > > almost exclusively dirty, so even after we unpin them, in order to make
> > > room they need to be paged out. Afaict, throttle_direct_reclaim() is
> > > supposed to be the point at which direct reclaim waits for writeback via
> > > kswapd and doesn't invoke writeback directly.
> > 
> > Well, throttle_direct_reclaim is mostly about preventing over reclaim
> > than anything else. It doesn't check the amount of dirty data and such.
> 
> It's the only mechanism I could see there that did try and wait on
> kswapd. __GFP_KSWAPD_RECLAIM just means to kick kswapd and not actually
> wait for progress from kswapd. I felt misled by that flag.

Well, we do not really wait for kswapd for __GFP_NORETRY. We just try to
reclaim and consume what we have done which is normally ok. If there is
a heavy and continuous memory pressure then we are out of luck but,
well, caller told us to not retry very hard so this should be an
acceptable behavior. Standard GFP_KERNEL allocations keep retrying the
reclaim which will allow kswapd/flushers to do their job.

-- 
Michal Hocko
SUSE Labs