[Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

Wed Nov 27 07:48:51 CET 2013

On Tue, Nov 26, 2013 at 08:23:46PM -0800, Ben Widawsky wrote:
> On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
> > If we end up calling the shrinker, which in turn requires the OOM
> > killer, we may end up infinitely waiting for a process to die if the OOM
> > chooses. The case that this prevents occurs in execbuf. The forked
> > variants of gem_evict_everything is a good way to hit it. This is
> > exacerbated by Daniel's recent patch to give OOM precedence to the GEM
> > tests.
> > 
> > It's a twisted form of a deadlock.
> > 
> > What occurs is the following (assume just 2 procs)
> > 1. proc A gets to execbuf while out of memory, gets struct_mutex.
> > 2. OOM killer comes in and chooses proc B
> > 3. proc B closes it's fds, which requires struct mutex, blocks
> > 4, OOM killer waits for B to die before killing another process (this
> > part is speculative)
> > 
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> 
> I'd still like to know if I am crazy, but I'm now trying to defer the
> stuff we do on file close without using any allocs. Just an update...
> 

workqueue still has similar problems. It could be because deferring the
context cleanup means we don't actually free much space, and so the OOM
isn't enough, or [more likely] something else is going on.

Maybe it's my bug. I am really out of ideas at the moment. The system
just becomes unresponsive after all threads end up blocked waiting for
struct mutex. I know I'd seen such problems in the past with
gem_evict_everything, but this time around I seem to be the sole cause
(and not all my machines hit it).

Sorry for the noise - just totally burning out on this one.

-- 
Ben Widawsky, Intel Open Source Technology Center