[Intel-gfx] [PATCH 27/33] drm/i915: Replace global breadcrumbs with per-context interrupt tracking

Fri Jan 25 14:26:49 UTC 2019

Quoting Tvrtko Ursulin (2019-01-25 13:54:05)
> 
> On 25/01/2019 02:29, Chris Wilson wrote:
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 2171df2d3019..c09a6644a2ab 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -60,7 +60,7 @@ static bool i915_fence_signaled(struct dma_fence *fence)
> >   
> >   static bool i915_fence_enable_signaling(struct dma_fence *fence)
> >   {
> > -     return intel_engine_enable_signaling(to_request(fence), true);
> > +     return i915_request_enable_breadcrumb(to_request(fence));
> 
> enable_signaling would be better IMO, enable_breadcrumb sounds like it 
> could do with breadcrumb emission. And it would align with 
> i915_fence_enable_signaling.

But isn't it to entirely do with our breadcrumbs...

I'm going the other way, while the request/interrupt logic is called
breadcrumbs, we want to link the dma-fence signaling back to
breadcrumbs.

I liked the new name :)

> >   /**
> >    * i915_request_started - check if the request has begun being executed
> >    * @rq: the request
> > @@ -345,7 +369,23 @@ static inline bool i915_request_started(const struct i915_request *rq)
> >               return true;
> >   
> >       /* Remember: started but may have since been preempted! */
> > -     return i915_seqno_passed(hwsp_seqno(rq), rq->fence.seqno - 1);
> > +     return __i915_request_has_started(rq);
> > +}
> 
> Is it worth having both i915_request_started and 
> __i915_request_has_started. AFAIU fence cannot be signaled unless the 
> seqno has advanced, so if __i915_request_has_started would be dropped 
> and the caller just use the former, I don't think there would be any 
> issues. Callers always check for completed before hand. I guess it saves 
> a redundant fence signaled check.

I was trying to consolidate down on the fence->flags logic, and decided
half the checks were redundant.

I was also dipping my toe into the water to see how everyone reacted to
a possible
s/i915_request_completed/i915_request_has_completed/
s/i915_request_started/i915_request_has_started/
i915_request_is_running

> > +static inline bool i915_request_is_running(const struct i915_request *rq)
> > +{
> > +     if (!i915_request_is_active(rq))
> > +             return false;
> > +
> > +     return __i915_request_has_started(rq);
> 
> return i915_request_is_active(rq) && __i915_request_has_started(rq); 
> probably doesn't fit in 80chars to be that attractive?

It keeps the pattern with the other status checks (started, completed,
is_running) though!

> > -static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
> > +static void irq_enable(struct intel_engine_cs *engine)
> > +{
> > +     if (!engine->irq_enable)
> > +             return;
> > +
> > +     /* Caller disables interrupts */
> 
> GEM_BUG_ON(!irqs_disabled); ?

Nah, I thought I had removed it before, but apparently there was more
than one check here.

It doesn't matter if we set the IMR while interrupts are disabled, we
get a hang report. We can't turn off interrupts without parking, and we
can't park without interrupts. That's probably the best point to say
"oi!"

> > +                     /*
> > +                      * We may race with direct invocation of
> > +                      * dma_fence_signal(), e.g. i915_request_retire(),
> > +                      * in which case we can skip processing it ourselves.
> > +                      */
> > +                     if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > +                                  &rq->fence.flags))
> > +                             continue;
> 
> Doesn't look this is worth it. We could then race a line below and still 
> end up with the already signaled rq on the local list. So would be 
> tempted to drop this for simplicity.

It's virtually free since we already have the exclusive cacheline for
rq->fence.flags. ~o~
(Except when it races and then we skip the atomic and list manipulation)

There's some benefit in having little reminders that many paths lead to
dma_fence_signal.

> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 3f9f0a5a0f44..d7c31ad1d0ea 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -486,8 +486,8 @@ static u32 *gen7_xcs_emit_breadcrumb(struct i915_request *rq, u32 *cs)
> >   
> >       for (i = 0; i < GEN7_XCS_WA; i++) {
> >               *cs++ = MI_STORE_DWORD_INDEX;
> > -             *cs++ = I915_GEM_HWS_INDEX_ADDR;
> > -             *cs++ = rq->global_seqno;
> > +             *cs++ = I915_GEM_HWS_SEQNO_ADDR;
> > +             *cs++ = rq->fence.seqno;
> 
> Why this, and in this patch?

A slip of the fingers. Spotted that last night as well and forgot to
move the chunk.

> > +             for (n = 0; n < count; n++) {
> > +                     struct i915_gem_context *ctx =
> > +                             t->contexts[order[n] % t->ncontexts];
> > +                     struct i915_request *rq;
> > +
> > +                     mutex_lock(BKL);
> > +
> > +                     rq = t->request_alloc(ctx, t->engine);
> > +                     if (IS_ERR(rq)) {
> > +                             mutex_unlock(BKL);
> > +                             err = PTR_ERR(rq);
> > +                             count = n;
> > +                             break;
> > +                     }
> > +
> > +                     err = i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> > +                                                            submit,
> > +                                                            GFP_KERNEL);
> 
> Submit fence is just to ensure request doesn't finish before you can 
> enable signaling on it, via the wait fence?

Yes. We put everything into a background queue, complete our setup, then
let the scheduler do its thing.

> > +
> > +                     requests[n] = i915_request_get(rq);
> > +                     i915_request_add(rq);
> > +
> > +                     mutex_unlock(BKL);
> > +
> > +                     if (err >= 0)
> > +                             err = i915_sw_fence_await_dma_fence(wait,
> > +                                                                 &rq->fence,
> > +                                                                 0,
> > +                                                                 GFP_KERNEL);
> 
> If above is true why this can't simply be i915_request_enable_breadcrumbs?

I was trying to test breadcrumbs at a high level without 'cheating'.

For i915_request_enable_breadcrumbs, I keep thinking the test should be
asking if the engine then has it on its signaling list, but that feels
like a circular argument, directly following the implementation itself

> > +                     if (err < 0) {
> 
> else if?

Now where would I be able to copy'n'paste that from? ;)

> > +                             i915_request_put(rq);
> > +                             count = n;
> > +                             break;
> > +                     }
> > +             }
> > +
> > +             i915_sw_fence_commit(submit);
> > +             i915_sw_fence_commit(wait);
> > +
> > +             if (!wait_event_timeout(wait->wait,
> > +                                     i915_sw_fence_done(wait),
> > +                                     HZ / 2)) {
> > +                     struct i915_request *rq = requests[count - 1];
> > +
> > +                     pr_err("waiting for %d fences (last %llx:%lld) on %s timed out!\n",
> > +                            count,
> > +                            rq->fence.context, rq->fence.seqno,
> > +                            t->engine->name);
> > +                     i915_gem_set_wedged(t->engine->i915);
> > +                     GEM_BUG_ON(!i915_request_completed(rq));
> > +                     i915_sw_fence_wait(wait);
> > +                     err = -EIO;
> > +             }
> > +
> > +             for (n = 0; n < count; n++) {
> > +                     struct i915_request *rq = requests[n];
> > +
> > +                     if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > +                                   &rq->fence.flags)) {
> > +                             pr_err("%llu:%llu was not signaled!\n",
> > +                                    rq->fence.context, rq->fence.seqno);
> > +                             err = -EINVAL;
> > +                     }
> > +
> > +                     i915_request_put(rq);
> > +             }
> > +
> > +             heap_fence_put(wait);
> > +             heap_fence_put(submit);
> > +
> > +             if (err < 0)
> > +                     break;
> > +
> > +             num_fences += count;
> 
> Isn't num_fences actually also growing by one, while num_requests would 
> be growing by count?

Each request is the rq->fence. We're inspecting requests/fences and
trusting i915_sw_fence.

> > +             num_waits++;
> > +
> > +             cond_resched();
> 
> There is wait_event_timeout in the loop so this shouldn't be needed. 
> AFAICS is something fails loop bails out.

Force of habit. Rather be overkill than have somebody moan about
RCU/khungtaskd stalls.

> > +     for (n = 0; n < ncpus; n++) {
> > +             threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
> > +                                      &t, "igt/%d", n);
> > +             if (IS_ERR(threads[n])) {
> > +                     ret = PTR_ERR(threads[n]);
> > +                     ncpus = n;
> 
> I would be tempted to just fail the test rather than carrying on. 
> Especially since it seems selftest will anyway report failure since ret 
> is preserved.

We have to tidy up, that's why we carry on, keeping the error state.

> > +                     break;
> > +             }
> > +
> > +             get_task_struct(threads[n]);
> > +     }
> 
> Thread creation looks like can be outside struct mutex.

Could be.

> > +static int
> > +max_batches(struct i915_gem_context *ctx, struct intel_engine_cs *engine)
> > +{
> > +     struct i915_request *rq;
> > +     int ret;
> > +
> > +     /*
> > +      * Before execlists, all contexts share the same ringbuffer. With
> > +      * execlists, each context/engine has a separate ringbuffer and
> > +      * for the purposes of this test, inexhaustible.
> > +      *
> > +      * For the global ringbuffer though, we have to be very careful
> > +      * that we do not wrap while preventing the execution of requests
> > +      * with a unsignaled fence.
> > +      */
> > +     if (HAS_EXECLISTS(ctx->i915))
> > +             return INT_MAX;
> > +
> > +     rq = i915_request_alloc(engine, ctx);
> > +     if (IS_ERR(rq)) {
> > +             ret = PTR_ERR(rq);
> > +     } else {
> > +             int sz;
> > +
> > +             ret = rq->ring->size - rq->reserved_space;
> > +             i915_request_add(rq);
> > +
> > +             sz = rq->ring->emit - rq->head;
> > +             if (sz < 0)
> > +                     sz += rq->ring->size;
> > +             ret /= sz;
> > +             ret /= 2; /* leave half spare, in case of emergency! */
> > +
> > +             /* One ring interleaved between requests from all cpus */
> > +             ret /= num_online_cpus() + 1;
> 
> It would probably be more robust for any potential future users to leave 
> out the num_cpus division to the caller.

But then I would only have INT_MAX/ncpus for execlists!!! :)

> > +static int live_breadcrumbs_smoketest(void *arg)
> > +{
> > +     struct drm_i915_private *i915 = arg;
> > +     struct smoketest t[I915_NUM_ENGINES];
> > +     unsigned int ncpus = num_online_cpus();
> > +     unsigned long num_waits, num_fences;
> > +     struct intel_engine_cs *engine;
> > +     struct task_struct **threads;
> > +     struct igt_live_test live;
> > +     enum intel_engine_id id;
> > +     intel_wakeref_t wakeref;
> > +     struct drm_file *file;
> > +     unsigned int n;
> > +     int ret = 0;
> > +
> > +     /*
> > +      * Smoketest our breadcrumb/signal handling for requests across multiple
> > +      * threads. A very simple test to only catch the most egregious of bugs.
> > +      * See __igt_breadcrumbs_smoketest();
> > +      *
> > +      * On real hardware this time.
> > +      */
> > +
> > +     wakeref = intel_runtime_pm_get(i915);
> > +
> > +     file = mock_file(i915);
> > +     if (IS_ERR(file)) {
> > +             ret = PTR_ERR(file);
> > +             goto out_rpm;
> > +     }
> > +
> > +     threads = kcalloc(ncpus * I915_NUM_ENGINES,
> > +                       sizeof(*threads),
> > +                       GFP_KERNEL);
> > +     if (!threads)
> > +             return -ENOMEM;
> 
> goto out_mock_file; to avoid two leaks.
> 
> > +
> > +     memset(&t[0], 0, sizeof(t[0]));
> > +     t[0].request_alloc = __live_request_alloc;
> > +     t[0].ncontexts = 64;
> > +     t[0].contexts = kmalloc_array(t[0].ncontexts,
> > +                                   sizeof(*t[0].contexts),
> > +                                   GFP_KERNEL);
> > +     if (!t[0].contexts) {
> > +             ret = -ENOMEM;
> > +             goto out_threads;
> > +     }
> > +
> > +     mutex_lock(&i915->drm.struct_mutex);
> > +     for (n = 0; n < t[0].ncontexts; n++) {
> > +             t[0].contexts[n] = live_context(i915, file);
> > +             if (!t[0].contexts[n]) {
> > +                     ret = -ENOMEM;
> > +                     goto out_contexts;
> 
> Context list unwind is missing.

The magic of live_context() is that are added to the file and released
when the file is freed if not earlier.

> > +     ret = igt_live_test_begin(&live, i915, __func__, "");
> > +     if (ret)
> > +             goto out_contexts;
> > +
> > +     for_each_engine(engine, i915, id) {
> > +             t[id] = t[0];
> > +             t[id].engine = engine;
> > +             t[id].max_batch = max_batches(t[0].contexts[0], engine);
> > +             if (t[id].max_batch < 0) {
> > +                     ret = t[id].max_batch;
> > +                     goto out_flush;
> > +             }
> > +             pr_debug("Limiting batches to %d requests on %s\n",
> > +                      t[id].max_batch, engine->name);
> > +
> > +             for (n = 0; n < ncpus; n++) {
> > +                     struct task_struct *tsk;
> > +
> > +                     tsk = kthread_run(__igt_breadcrumbs_smoketest,
> > +                                       &t[id], "igt/%d.%d", id, n);
> > +                     if (IS_ERR(tsk)) {
> > +                             ret = PTR_ERR(tsk);
> > +                             goto out_flush;
> > +                     }
> > +
> > +                     get_task_struct(tsk);
> > +                     threads[id * ncpus + n] = tsk;
> > +             }
> > +     }
> > +     mutex_unlock(&i915->drm.struct_mutex);
> 
> I think thread setup could be done outside struct mutext in two-pass, if 
> only max_batches calculation is done first under it.

Then do it as one pass, since having a thread start a few microseconds
earlier is immaterial in the grand scheme of things.

> > +     msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
> > +
> > +out_flush:
> > +     num_waits = 0;
> > +     num_fences = 0;
> > +     for_each_engine(engine, i915, id) {
> > +             for (n = 0; n < ncpus; n++) {
> > +                     struct task_struct *tsk = threads[id * ncpus + n];
> > +                     int err;
> > +
> > +                     if (!tsk)
> > +                             continue;
> > +
> > +                     err = kthread_stop(tsk);
> > +                     if (err < 0 && !ret)
> > +                             ret = err;
> > +
> > +                     put_task_struct(tsk);
> > +             }
> > +
> > +             num_waits += atomic_long_read(&t[id].num_waits);
> > +             num_fences += atomic_long_read(&t[id].num_fences);
> > +     }
> > +     pr_info("Completed %lu waits for %lu fence across %d engines and %d cpus\n",
> 
> fences plural

But it might only be 1! ;)
-Chris