[Intel-gfx] i915 memory allocation failure..

Mon Jun 5 09:04:35 UTC 2017

Quoting Linus Torvalds (2017-06-05 05:26:06)
> So there's something wrong with the memory allocation changes since
> 4.11, which seem to be mostly credited to Chris Wilson.
> 
> In particular, I got this earlier today:
> 
>   Xorg: page allocation failure: order:0,
> mode:0x14210d2(GFP_HIGHUSER|__GFP_NORETRY|__GFP_RECLAIMABLE),
> nodemask=(null)
> 
> and then soon afterwards in the log I see
> 
>   chrome[13102]: segfault at 968 ip 00007f472a7fda83 sp
> 00007fffab9a6ef0 error 4 in libX11.so.6.3.0[7f472a7d1000+138000]
>   gnome-session-f[13115]: segfault at 0 ip 00007f7e765ab4b9 sp
> 00007ffca5990470 error 4 in libgtk-3.so.0.2200.15[7f7e762cc000+6f9000]
> 
> which I assume is related to broken error handling.
> 
> So there's at least two bugs here:
> 
>  (a) order-0 memory allocation failure is generally a sign of
> something bad. We clearly give up *much* too easily.

Since we sent this patch to you and seen wider testing, I've been trying
to get it reverted.

    drm/i915: Remove __GFP_NORETRY from our buffer allocator

    I tried __GFP_NORETRY in the belief that __GFP_RECLAIM was effective. It
    struggles with handling reclaim via kswapd (through inconsistency within
    throttle_direct_reclaim() and even then the race between multiple
    allocators makes the two step of reclaim then allocate fragile), and as
    our buffers are always dirty (with very few exceptions), we required
    kswapd to perform pageout on them. The only effective means of waiting
    on kswapd is to retry the allocations (i.e. not set __GFP_NORETRY). That
    leaves us with the dilemma of invoking the oomkiller instead of
    propagating the allocation failure back to userspace where it can be
    handled more gracefully (one hopes). 

The situation where this is most likely to happen though is where i915
has consumed all available memory. That's not a good sign.

>  (b) using __GFP_NORETRY and wanting the memory failure, but then not
> using __GFP_NOWARN is just stupid.
> 
> Now, (b) initially made me go "I'll just add that __GFP_NOWARN
> myself". Because it's true - if you intentionally tell the VM
> subsystem that you'd rather get a failed allocation than try a bit
> harder, then you obviously shouldn't get the warning either. I think
> the VM people have talked about just considering NORETRY to imply
> NOWARN.

We don't want a WARN here because the error is always sent back to
userspace, and userspace then has an opportunity to act upon that
-ENOMEM. The goal here was to return -ENOMEM rather than invoke the
oomkiller or worse lowmemkiller.

> However, the fact that this actually caused problems in downstream
> user space, and the fact that this happened with an order-0 allocation
> made me re-consider. That allocation clearly *is* important, and
> returning NULL may "work" from a kernel standpoint, but it sure as
> hell didn't work in the bigger picture, now did it?

Right, everything is simply broken. Userspace runs out of memory and
dies. We obviously can't expect them to fix such bugs or how we ended up
so much memory tied to gfx, so everyone should install more memory.

> So the warning was actually good in this case. This may in fact be an
> example of why GFP_NORETRY should *not* imply NOWARN.

Was it? It didn't tell you why userspace died, just that some time before
we ran out of memory for a gfx allocation. I would hope the userspace
logs gave a clearer picture of how it failed.

> So instead of shutting up the warning, I pass it over to the i915
> people. Making that allocation fail easily wasn't such a great idea
> after all, was it? Maybe that NORETRY should be reconsidered, at least
> for important (perhaps small?) allocations?

What's important? A framebuffer being prepped for scanout? That can be
256M, more typically around 8-32M at present.

> Also adding some VM people, because I think it's ridiculous that the
> 0-order allocation failed in the first place. Full report attached,
> there's tons of memory that should have been trivial to free.
> 
> So I suspect GFP_NORETRY ends up being *much* too aggressive, and
> basically doesn't even try any trivial freeing.
> 
> Maybe we want some middle ground between "retry forever" and "don't
> try at all". In people trying to fight the "retry forever", we seem to
> have gone too far in the "don't even bother, just return NULL"
> direction.

I wanted something like Michal's __GFP_MAYFAIL just to give up before we
finally invoke oomkiller.
-Chris