[Intel-gfx] [PATCH 3/3] drm/i915/selftests: Exercise adding requests to a full GGTT

Thu Oct 12 10:15:44 UTC 2017

Quoting Tvrtko Ursulin (2017-10-12 10:48:07)
> 
> On 11/10/2017 15:06, Chris Wilson wrote:
> > A bug recently encountered involved the issue where are we were
> > submitting requests to different ppGTT, each would pin a segment of the
> > GGTT for its logical context and ring. However, this is invisible to
> > eviction as we do not tie the context/ring VMA to a request and so do
> > not automatically wait upon it them (instead they are marked as pinned,
> > prevent eviction entirely). Instead the eviction code must flush those
> > contexts by switching to the kernel context. This selftest tries to
> > fill the GGTT with contexts to exercise a path where the
> > switch-to-kernel-context failed to make forward progress and we fail
> > with ENOSPC.
> > 
> > v2: Make the hole in the filled GGTT explicit.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> > +     /* Overfill the GGTT with context objects and so try to evict one. */
> > +     for_each_engine(engine, i915, id) {
> > +             struct timed_fence tf;
> > +             struct drm_file *file;
> > +
> > +             file = mock_file(i915);
> > +             if (IS_ERR(file))
> > +                     return PTR_ERR(file);
> > +
> > +             count = 0;
> > +             mutex_lock(&i915->drm.struct_mutex);
> > +             timed_fence_init(&tf, round_jiffies_up(jiffies + HZ/2));
> 
> Should the timeout be i915.selftests.timeout_jiffies? Hm nor sure, since 
> it is per engine as well.

It's tricky, since we need a timeout that is sufficient to bind the
remaining space and then enough time to run a few failed evict passes.

I don't feel comfortable in using the igt controlled timeout for that,
since the test is inherently timing sensitive.

What I did not fully explore was to how to get feedback from
evict_for_something that we were going to idle so that we didn't need
the timeout at all. (Other than the necessary fence to ensure that all
requests were only queued and not executed. There I like the timed_fence
as a backup so that nothing is stuck forever :)

> 
> > +             do {
> > +                     struct drm_i915_gem_request *rq;
> > +                     struct i915_gem_context *ctx;
> > +
> > +                     ctx = live_context(i915, file);
> > +                     if (!ctx)
> > +                             break;
> > +
> > +                     rq = i915_gem_request_alloc(engine, ctx);
> > +                     if (IS_ERR(rq)) {
> > +                             pr_err("Unexpected error from request alloc (ctx hw id %u, on %s): %d\n",
> > +                                    ctx->hw_id, engine->name,
> > +                                    (int)PTR_ERR(rq));
> > +                             err = PTR_ERR(rq) > +                           break;
> > +                     }
> > +
> > +                     i915_sw_fence_await_sw_fence_gfp(&rq->submit, &tf.fence,
> > +                                                      GFP_KERNEL);
> > +
> > +                     i915_add_request(rq);
> > +                     count++;
> > +             } while(!i915_sw_fence_done(&tf.fence));
> > +             mutex_unlock(&i915->drm.struct_mutex);
> > +
> > +             timed_fence_fini(&tf);
> > +             pr_info("Submitted %lu contexts/requests on %s\n",
> > +                     count, engine->name);
> 
> I worry about no upper bound on the number of requests this will queue 
> in HZ/2. What are the typical numbers it reports here? Would it be 
> better to try some minimum number of requests (like a multiplier of the 
> pretend GGTT size, if we can query or guess the typical LRC/ringbuffer 
> sizes per context) and then signal the fence manually?

The upper bound is how many we can fit in the remaining ~16MiB of space.
The amount required per request is a bit nebulous (context size varies
per engine per gen, ring sizes differ, fragmentation etc). Apart from on
legacy !full-ppgtt where we would just execute until timeout with lots
of fake contexts.

I agree that using a timeout is very unsatisfactory. Let's look at
injecting code into evict_something instead and see if that works any
better.
-Chris