[Intel-gfx] [PATCH] drm/i915: Limit the backpressure for i915_request allocation

Wed Sep 12 14:54:02 UTC 2018

Quoting Daniel Vetter (2018-09-12 15:47:21)
> On Wed, Sep 12, 2018 at 3:42 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > Quoting Tvrtko Ursulin (2018-09-12 14:34:16)
> >>
> >> On 12/09/2018 12:11, Chris Wilson wrote:
> >> > If we try and fail to allocate a i915_request, we apply some
> >> > backpressure on the clients to throttle the memory allocations coming
> >> > from i915.ko. Currently, we wait until completely idle, but this is far
> >> > too heavy and leads to some situations where the only escape is to
> >> > declare a client hung and reset the GPU. The intent is to only ratelimit
> >> > the allocation requests, so we need only wait for a jiffie before using
> >> > the normal direct reclaim.
> >> >
> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> >> > ---
> >> >   drivers/gpu/drm/i915/i915_request.c | 2 +-
> >> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> >> > index 09ed48833b54..588bc5a4d18b 100644
> >> > --- a/drivers/gpu/drm/i915/i915_request.c
> >> > +++ b/drivers/gpu/drm/i915/i915_request.c
> >> > @@ -736,7 +736,7 @@ i915_request_alloc(struct intel_engine_cs *engine, struct i915_gem_context *ctx)
> >> >               ret = i915_gem_wait_for_idle(i915,
> >> >                                            I915_WAIT_LOCKED |
> >> >                                            I915_WAIT_INTERRUPTIBLE,
> >> > -                                          MAX_SCHEDULE_TIMEOUT);
> >> > +                                          1);
> >> >               if (ret)
> >> >                       goto err_unreserve;
> >> >
> >> >
> >>
> >> What is the remaining value of even trying to wait for idle, instead of
> >> maybe just i915_request_retire and sleep for a jiffie? The intention
> >> would potentially read clearer since it is questionable there is any
> >> relationship with idle and rate limiting clients. In fact, now that I
> >> think of it, waiting for idle is a nice way to starve an unlucky client
> >> forever.
> >
> > Better to starve the unlucky client, than to allow the entire system to
> > grind to a halt.
> >
> > One caveat to using RCU is that it is our responsibility to apply
> > backpressure as none is applied by the vm.
> 
> So instead of 1 jiffie, should we wait for 1 rcu grace period? My
> understanding is that under very heavy load these can be extended
> (since batching is more effective for throughput, if you don't run out
> of memory). Just a random comment from the sidelines really :-)

There's (essentially) a wait for that later ;)

But a long time ago Paul did write a missive explaining that there
should be some use of cond_synchronize_rcu() to provide the
rate-limiting, but I've never been sure of a way to plug in the right
figures. Do we say if the timeline does more than two RCU allocations
within the same grace period it should be throttled? A light allocation
failure has always seemed to be a sensible trigger to start worrying.
-Chris