[Bug 111512] [snb] GPU HANG in gnome-shell (after swap?)

Thu Aug 29 17:00:03 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=111512

--- Comment #7 from Chris Wilson <chris at chris-wilson.co.uk> ---
(In reply to Chris Murphy from comment #6)
> (In reply to Chris Wilson from comment #4)
> > Allocation failure on swap-in is not uncommon when dealing with 5G+ of swap,
> > the kernel struggles to cope and we make more noise than most.
> 
> Interesting. This suggests an incongruence between typical 1:1 RAM swap
> partition sizes by most distro installers, at least for use cases where
> there will be heavy pressure on RAM rather than incidental swap usage. In
> your view, is this a case of, "doctor, it hurts when I do this" and the
> doctor says, "right, so don't do that" or is there room for improvement?

It's definitely the kernel's problem in mishandling resources, there are plenty
still available, we just aren't getting the pages when they are required, as
they are required. Aside from that, we are not prioritising interactive
workloads very well under these conditions. From our point of view that only
increases the mempressure for graphic resources -- work builds up faster than
we can process, write amplification from client to display.

> Note: these examples are unique in that the test system is using swap on
> ZRAM. So it should be significantly faster than conventional swap on a
> partition. Also, these examples have /dev/zram0 sized to 1.5X RAM, but it's
> reproducible at 1:1. In smaller swap cases, I've seen these same call traces
> far less frequently, and also oom-killer happens more frequently.
> 
> > That failure
> > does not look to be the cause of the later hang, though it may indeed be
> > related to memory pressure (although being snb it is llc so less susceptible
> > to most forms of corruption, you can still hypothesize data not making it
> > to/from swap that leads to context corruption). I would say the memory
> > layout of the batch supports the hypothesis that the context has been
> > swapped out and back in. So I am going to err on the side of assuming this
> > is an invalid context image due to swap.
> 
> The narrow goal of this torture test is to find ways of improving system
> responsiveness under heavy swap use. And also it acts much like an
> unprivileged fork bomb that can, somewhat non-deterministically I'm finding,
> take down the system (totally unresponsive for >30 minutes). And in doing
> that, I'm stumbling over other issues like this one.

Yup. Death-by-swap is an old problem (when the oomkiller doesn't kill you, you
can die of old age waiting for a response wishing it had). Most of our effort
is spent trying to minimise the system-wide impact when running at max memory
(when the caches are regularly reaped), handling swap well has been an after
thought for a decade.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20190829/5621e300/attachment.html>