[Intel-gfx] [PATCH v2 03/37] drm/i915/region: support basic eviction

Thu Aug 15 14:34:02 UTC 2019

On Thu, Aug 15, 2019 at 4:26 PM Daniel Vetter <daniel at ffwll.ch> wrote:
> On Thu, Aug 15, 2019 at 12:48 PM Matthew Auld
> <matthew.william.auld at gmail.com> wrote:
> >
> > On Tue, 30 Jul 2019 at 17:26, Daniel Vetter <daniel at ffwll.ch> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 09:55:59PM +0100, Matthew Auld wrote:
> > > > Support basic eviction for regions.
> > > >
> > > > Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> > > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > > > Cc: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
> > >
> > > So from a very high level this looks like it was largely modelled after
> > > i915_gem_shrink.c and not i915_gem_evict.c (our other "make room, we're
> > > running out of stuff" code). Any specific reasons?
> >
> > IIRC I think it was originally based on the patches that exposed
> > stolen-memory to userspace from a few years ago.

Forgot to add this:

Yeah, I guess those have been best-effort to at most. Or at least I
never really looked at them seriously since the open source userspace
never went anywhere.
-Daniel

> > > I think i915_gem_evict is a lot closer match for what we want for vram (it
> > > started out to manage severely limitted GTT on gen2/3/4) after all. With
> > > the complication that we'll have to manage physical memory with multiple
> > > virtual mappings of it on top, so unfortunately we can't just reuse the
> > > locking patter Chris has come up with in his struct_mutex-removal branch.
> > > But at least conceptually it should be a lot closer.
> >
> > When you say make it more like i915_gem_evict, what does that mean?
> > Are you talking about the eviction roster stuff, or the
> > placement/locking of the eviction logic, with it being deep down in
> > get_pages?
>
> So there's kinda two aspects here that I meant.
>
> First is the high-level approach of the shrinker, which is a direct
> reflection of core mm low memory handling principles: Core mm just
> tries to equally shrink everyone when there's low memory, which is
> managed by watermarks, and a few other tricks. This is all only
> best-effort, and if multiple threads want a lot of memory at the same
> time then it's all going to fail with ENOMEM.
>
> On gpus otoh, and what we do in i915_gem_eviction.c for gtt (and very
> much needed with the tiny gtt for everything in gen2/3/4/5) is that
> when we run out of space, we stall, throw out everyone else, and have
> exclusive access to the entire gpu space. Then the next batchbuffer
> goes through the same dance. With this you guarantee that if you have
> a series of batchbuffers which all need e.g. 60% of lmem, they will
> all be able to execute. With the shrinker-style of low-memory handling
> eventually you're unlucky, both threads will only get up to 50%, fail
> with ENOSPC, and userspace crashes. Which is not good.
>
> The other bit is locking. Since we need to free pages from the
> shrinker there's tricky locking rules involved. Worse, we cannot back
> off from the shrinker down to e.g. the kmalloc or alloc_pages called
> that put us into reclaim. Which means the usual deadlock avoidance
> trick of having a slowpath where you first drop all the locks, then
> reacquire them in the right order doesn't work - in some cases the
> caller of kmalloc or alloc_pages could be holding a lock that we'd
> need to unlock first. Hence why the shrinker uses the
> best-effort-might-fail solution of trylocks, encoded in shrinker_lock.
>
> But for lmem we don't have such an excuse, because it's all our own
> code. The locking design can (and should!) assume that it can get out
> of any deadlock and always acquire all the locks it needs. Without
> that you can't achive the first part about guaranteeing execution of
> batches which collectively need more than 100% of lmem, but
> individually all fit. As an example if you look at the amdgpu command
> submission ioctl, that passes around ttm_operation_ctx which tracks a
> few things about locks and other bits, and if they hit a possible
> deadlock situation they can unwind the entire CS and restart by taking
> the locks in the right order.
>
> I thought I typed that up somewhere, but I guess it got lost ...
>
> Cheers, Daniel
>
> >
> > >
> > > But I might be entirely off the track with reconstructing how this code
> > > came to be, so please elaborate a bit.
> > >
> > > Thanks, Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch