[Intel-gfx] [PATCH v2] drm/i915: Remove __GFP_NORETRY from our buffer allocator

Mon Jun 5 11:51:19 UTC 2017

Quoting Daniel Stone (2017-06-05 11:47:44)
> Hi,
> 
> On 5 June 2017 at 11:35, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > I tried __GFP_NORETRY in the belief that __GFP_RECLAIM was effective. It
> > struggles with handling reclaim via kswapd (through inconsistency within
> > throttle_direct_reclaim() and even then the race between multiple
> > allocators makes the two step of reclaim then allocate fragile), and as
> > our buffers are always dirty (with very few exceptions), we required
> > kswapd to perform pageout on them. The only effective means of waiting
> > on kswapd is to retry the allocations (i.e. not set __GFP_NORETRY). That
> > leaves us with the dilemma of invoking the oomkiller instead of
> > propagating the allocation failure back to userspace where it can be
> > handled more gracefully (one hopes).
> 
> The i965 GL(ES) driver may dash your hopes somewhat:
> 
>          ret = execbuffer(dri_screen->fd, batch, hw_ctx,
>                           4 * USED_BATCH(*batch),
>                           in_fence_fd, out_fence_fd, flags);
> 
>    if (ret != 0) {
>       fprintf(stderr, "intel_do_flush_locked failed: %s\n", strerror(-ret));
>       exit(1);
>    }

And their response has been that mesa is not a system critical library,
so don't use it in such roles. We have also floated patches for several
years now for that error to percolate back. Otoh, all the other memory
handling paths have more or less been fixed to report back that failure.
You'll note from Linus's message it wasn't this that failed, but one of
those other allocation paths and the error handling associated with them
in the client.

> Before removing NORETRY, occasionally I'd get lucky and Chrome would
> fail, but usually it'd be Mutter and my entire session would
> disappear. I'm also not sure what a good strategy as a compositor
> would be: just keep on trying updates until you get lucky? Sit doing
> nothing for a while and hope redraws succeed 'later' ... ?

In terms of mutter and chrome, the first question is why have they
filled all of memory with buffers. I strongly suspect there are leaks all
around, but if there are not something as complex as chrome can easily
identify buffers it has kept merely for convenience that can be rebuilt
on demand. (But they should already be using GL APPLE purgeable memory
for those.) There is a wider problem than this, having seen a growth in
the number of order-6 allocations failures over the past few kernels (or
rather distribution updates) for simple cursor updates (where the first
question to be asked is why is the cursor being reallocated?)

And yes, they should be robust in handling error conditions and throw
away rendering if it failed. A compositor may even preallocate resources
so that it can throw a message onto a screen if a failure occurs, a
lowlevel driver can do that and with Vulkan that too should be possible
(at least down to a few mallocs internal to the driver, but using an
externally controlled allocator).

> Similarly
> to Linus, I was in a position where reclaim should've been extremely
> effective - into the gigabytes - at the time, so pushing reclaim
> harder and taking a small time penalty seems far better than a hard
> failure.

The point here was that we did reclaim and fail (the failure outlined in
the changelog was that reclaim is not waiting for kswapd due a false
positive from allow_direct_reclaim()); after having first purged
everything unwanted from the set of i915 buffers. It wasn't that we asked
for no reclaim, it's just that we don't want to have the oomkiller kill
something at random. And there was no middle ground in the set of gfp
flags.
-Chris