[Intel-gfx] [PATCH 27/33] drm/i915: Replace global breadcrumbs with per-context interrupt tracking
Chris Wilson
chris at chris-wilson.co.uk
Fri Jan 25 14:26:49 UTC 2019
Quoting Tvrtko Ursulin (2019-01-25 13:54:05)
>
> On 25/01/2019 02:29, Chris Wilson wrote:
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 2171df2d3019..c09a6644a2ab 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -60,7 +60,7 @@ static bool i915_fence_signaled(struct dma_fence *fence)
> >
> > static bool i915_fence_enable_signaling(struct dma_fence *fence)
> > {
> > - return intel_engine_enable_signaling(to_request(fence), true);
> > + return i915_request_enable_breadcrumb(to_request(fence));
>
> enable_signaling would be better IMO, enable_breadcrumb sounds like it
> could do with breadcrumb emission. And it would align with
> i915_fence_enable_signaling.
But isn't it to entirely do with our breadcrumbs...
I'm going the other way, while the request/interrupt logic is called
breadcrumbs, we want to link the dma-fence signaling back to
breadcrumbs.
I liked the new name :)
> > /**
> > * i915_request_started - check if the request has begun being executed
> > * @rq: the request
> > @@ -345,7 +369,23 @@ static inline bool i915_request_started(const struct i915_request *rq)
> > return true;
> >
> > /* Remember: started but may have since been preempted! */
> > - return i915_seqno_passed(hwsp_seqno(rq), rq->fence.seqno - 1);
> > + return __i915_request_has_started(rq);
> > +}
>
> Is it worth having both i915_request_started and
> __i915_request_has_started. AFAIU fence cannot be signaled unless the
> seqno has advanced, so if __i915_request_has_started would be dropped
> and the caller just use the former, I don't think there would be any
> issues. Callers always check for completed before hand. I guess it saves
> a redundant fence signaled check.
I was trying to consolidate down on the fence->flags logic, and decided
half the checks were redundant.
I was also dipping my toe into the water to see how everyone reacted to
a possible
s/i915_request_completed/i915_request_has_completed/
s/i915_request_started/i915_request_has_started/
i915_request_is_running
> > +static inline bool i915_request_is_running(const struct i915_request *rq)
> > +{
> > + if (!i915_request_is_active(rq))
> > + return false;
> > +
> > + return __i915_request_has_started(rq);
>
> return i915_request_is_active(rq) && __i915_request_has_started(rq);
> probably doesn't fit in 80chars to be that attractive?
It keeps the pattern with the other status checks (started, completed,
is_running) though!
> > -static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
> > +static void irq_enable(struct intel_engine_cs *engine)
> > +{
> > + if (!engine->irq_enable)
> > + return;
> > +
> > + /* Caller disables interrupts */
>
> GEM_BUG_ON(!irqs_disabled); ?
Nah, I thought I had removed it before, but apparently there was more
than one check here.
It doesn't matter if we set the IMR while interrupts are disabled, we
get a hang report. We can't turn off interrupts without parking, and we
can't park without interrupts. That's probably the best point to say
"oi!"
> > + /*
> > + * We may race with direct invocation of
> > + * dma_fence_signal(), e.g. i915_request_retire(),
> > + * in which case we can skip processing it ourselves.
> > + */
> > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > + &rq->fence.flags))
> > + continue;
>
> Doesn't look this is worth it. We could then race a line below and still
> end up with the already signaled rq on the local list. So would be
> tempted to drop this for simplicity.
It's virtually free since we already have the exclusive cacheline for
rq->fence.flags. ~o~
(Except when it races and then we skip the atomic and list manipulation)
There's some benefit in having little reminders that many paths lead to
dma_fence_signal.
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 3f9f0a5a0f44..d7c31ad1d0ea 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -486,8 +486,8 @@ static u32 *gen7_xcs_emit_breadcrumb(struct i915_request *rq, u32 *cs)
> >
> > for (i = 0; i < GEN7_XCS_WA; i++) {
> > *cs++ = MI_STORE_DWORD_INDEX;
> > - *cs++ = I915_GEM_HWS_INDEX_ADDR;
> > - *cs++ = rq->global_seqno;
> > + *cs++ = I915_GEM_HWS_SEQNO_ADDR;
> > + *cs++ = rq->fence.seqno;
>
> Why this, and in this patch?
A slip of the fingers. Spotted that last night as well and forgot to
move the chunk.
> > + for (n = 0; n < count; n++) {
> > + struct i915_gem_context *ctx =
> > + t->contexts[order[n] % t->ncontexts];
> > + struct i915_request *rq;
> > +
> > + mutex_lock(BKL);
> > +
> > + rq = t->request_alloc(ctx, t->engine);
> > + if (IS_ERR(rq)) {
> > + mutex_unlock(BKL);
> > + err = PTR_ERR(rq);
> > + count = n;
> > + break;
> > + }
> > +
> > + err = i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> > + submit,
> > + GFP_KERNEL);
>
> Submit fence is just to ensure request doesn't finish before you can
> enable signaling on it, via the wait fence?
Yes. We put everything into a background queue, complete our setup, then
let the scheduler do its thing.
> > +
> > + requests[n] = i915_request_get(rq);
> > + i915_request_add(rq);
> > +
> > + mutex_unlock(BKL);
> > +
> > + if (err >= 0)
> > + err = i915_sw_fence_await_dma_fence(wait,
> > + &rq->fence,
> > + 0,
> > + GFP_KERNEL);
>
> If above is true why this can't simply be i915_request_enable_breadcrumbs?
I was trying to test breadcrumbs at a high level without 'cheating'.
For i915_request_enable_breadcrumbs, I keep thinking the test should be
asking if the engine then has it on its signaling list, but that feels
like a circular argument, directly following the implementation itself
> > + if (err < 0) {
>
> else if?
Now where would I be able to copy'n'paste that from? ;)
> > + i915_request_put(rq);
> > + count = n;
> > + break;
> > + }
> > + }
> > +
> > + i915_sw_fence_commit(submit);
> > + i915_sw_fence_commit(wait);
> > +
> > + if (!wait_event_timeout(wait->wait,
> > + i915_sw_fence_done(wait),
> > + HZ / 2)) {
> > + struct i915_request *rq = requests[count - 1];
> > +
> > + pr_err("waiting for %d fences (last %llx:%lld) on %s timed out!\n",
> > + count,
> > + rq->fence.context, rq->fence.seqno,
> > + t->engine->name);
> > + i915_gem_set_wedged(t->engine->i915);
> > + GEM_BUG_ON(!i915_request_completed(rq));
> > + i915_sw_fence_wait(wait);
> > + err = -EIO;
> > + }
> > +
> > + for (n = 0; n < count; n++) {
> > + struct i915_request *rq = requests[n];
> > +
> > + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > + &rq->fence.flags)) {
> > + pr_err("%llu:%llu was not signaled!\n",
> > + rq->fence.context, rq->fence.seqno);
> > + err = -EINVAL;
> > + }
> > +
> > + i915_request_put(rq);
> > + }
> > +
> > + heap_fence_put(wait);
> > + heap_fence_put(submit);
> > +
> > + if (err < 0)
> > + break;
> > +
> > + num_fences += count;
>
> Isn't num_fences actually also growing by one, while num_requests would
> be growing by count?
Each request is the rq->fence. We're inspecting requests/fences and
trusting i915_sw_fence.
> > + num_waits++;
> > +
> > + cond_resched();
>
> There is wait_event_timeout in the loop so this shouldn't be needed.
> AFAICS is something fails loop bails out.
Force of habit. Rather be overkill than have somebody moan about
RCU/khungtaskd stalls.
> > + for (n = 0; n < ncpus; n++) {
> > + threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
> > + &t, "igt/%d", n);
> > + if (IS_ERR(threads[n])) {
> > + ret = PTR_ERR(threads[n]);
> > + ncpus = n;
>
> I would be tempted to just fail the test rather than carrying on.
> Especially since it seems selftest will anyway report failure since ret
> is preserved.
We have to tidy up, that's why we carry on, keeping the error state.
> > + break;
> > + }
> > +
> > + get_task_struct(threads[n]);
> > + }
>
> Thread creation looks like can be outside struct mutex.
Could be.
> > +static int
> > +max_batches(struct i915_gem_context *ctx, struct intel_engine_cs *engine)
> > +{
> > + struct i915_request *rq;
> > + int ret;
> > +
> > + /*
> > + * Before execlists, all contexts share the same ringbuffer. With
> > + * execlists, each context/engine has a separate ringbuffer and
> > + * for the purposes of this test, inexhaustible.
> > + *
> > + * For the global ringbuffer though, we have to be very careful
> > + * that we do not wrap while preventing the execution of requests
> > + * with a unsignaled fence.
> > + */
> > + if (HAS_EXECLISTS(ctx->i915))
> > + return INT_MAX;
> > +
> > + rq = i915_request_alloc(engine, ctx);
> > + if (IS_ERR(rq)) {
> > + ret = PTR_ERR(rq);
> > + } else {
> > + int sz;
> > +
> > + ret = rq->ring->size - rq->reserved_space;
> > + i915_request_add(rq);
> > +
> > + sz = rq->ring->emit - rq->head;
> > + if (sz < 0)
> > + sz += rq->ring->size;
> > + ret /= sz;
> > + ret /= 2; /* leave half spare, in case of emergency! */
> > +
> > + /* One ring interleaved between requests from all cpus */
> > + ret /= num_online_cpus() + 1;
>
> It would probably be more robust for any potential future users to leave
> out the num_cpus division to the caller.
But then I would only have INT_MAX/ncpus for execlists!!! :)
> > +static int live_breadcrumbs_smoketest(void *arg)
> > +{
> > + struct drm_i915_private *i915 = arg;
> > + struct smoketest t[I915_NUM_ENGINES];
> > + unsigned int ncpus = num_online_cpus();
> > + unsigned long num_waits, num_fences;
> > + struct intel_engine_cs *engine;
> > + struct task_struct **threads;
> > + struct igt_live_test live;
> > + enum intel_engine_id id;
> > + intel_wakeref_t wakeref;
> > + struct drm_file *file;
> > + unsigned int n;
> > + int ret = 0;
> > +
> > + /*
> > + * Smoketest our breadcrumb/signal handling for requests across multiple
> > + * threads. A very simple test to only catch the most egregious of bugs.
> > + * See __igt_breadcrumbs_smoketest();
> > + *
> > + * On real hardware this time.
> > + */
> > +
> > + wakeref = intel_runtime_pm_get(i915);
> > +
> > + file = mock_file(i915);
> > + if (IS_ERR(file)) {
> > + ret = PTR_ERR(file);
> > + goto out_rpm;
> > + }
> > +
> > + threads = kcalloc(ncpus * I915_NUM_ENGINES,
> > + sizeof(*threads),
> > + GFP_KERNEL);
> > + if (!threads)
> > + return -ENOMEM;
>
> goto out_mock_file; to avoid two leaks.
>
> > +
> > + memset(&t[0], 0, sizeof(t[0]));
> > + t[0].request_alloc = __live_request_alloc;
> > + t[0].ncontexts = 64;
> > + t[0].contexts = kmalloc_array(t[0].ncontexts,
> > + sizeof(*t[0].contexts),
> > + GFP_KERNEL);
> > + if (!t[0].contexts) {
> > + ret = -ENOMEM;
> > + goto out_threads;
> > + }
> > +
> > + mutex_lock(&i915->drm.struct_mutex);
> > + for (n = 0; n < t[0].ncontexts; n++) {
> > + t[0].contexts[n] = live_context(i915, file);
> > + if (!t[0].contexts[n]) {
> > + ret = -ENOMEM;
> > + goto out_contexts;
>
> Context list unwind is missing.
The magic of live_context() is that are added to the file and released
when the file is freed if not earlier.
> > + ret = igt_live_test_begin(&live, i915, __func__, "");
> > + if (ret)
> > + goto out_contexts;
> > +
> > + for_each_engine(engine, i915, id) {
> > + t[id] = t[0];
> > + t[id].engine = engine;
> > + t[id].max_batch = max_batches(t[0].contexts[0], engine);
> > + if (t[id].max_batch < 0) {
> > + ret = t[id].max_batch;
> > + goto out_flush;
> > + }
> > + pr_debug("Limiting batches to %d requests on %s\n",
> > + t[id].max_batch, engine->name);
> > +
> > + for (n = 0; n < ncpus; n++) {
> > + struct task_struct *tsk;
> > +
> > + tsk = kthread_run(__igt_breadcrumbs_smoketest,
> > + &t[id], "igt/%d.%d", id, n);
> > + if (IS_ERR(tsk)) {
> > + ret = PTR_ERR(tsk);
> > + goto out_flush;
> > + }
> > +
> > + get_task_struct(tsk);
> > + threads[id * ncpus + n] = tsk;
> > + }
> > + }
> > + mutex_unlock(&i915->drm.struct_mutex);
>
> I think thread setup could be done outside struct mutext in two-pass, if
> only max_batches calculation is done first under it.
Then do it as one pass, since having a thread start a few microseconds
earlier is immaterial in the grand scheme of things.
> > + msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
> > +
> > +out_flush:
> > + num_waits = 0;
> > + num_fences = 0;
> > + for_each_engine(engine, i915, id) {
> > + for (n = 0; n < ncpus; n++) {
> > + struct task_struct *tsk = threads[id * ncpus + n];
> > + int err;
> > +
> > + if (!tsk)
> > + continue;
> > +
> > + err = kthread_stop(tsk);
> > + if (err < 0 && !ret)
> > + ret = err;
> > +
> > + put_task_struct(tsk);
> > + }
> > +
> > + num_waits += atomic_long_read(&t[id].num_waits);
> > + num_fences += atomic_long_read(&t[id].num_fences);
> > + }
> > + pr_info("Completed %lu waits for %lu fence across %d engines and %d cpus\n",
>
> fences plural
But it might only be 1! ;)
-Chris
More information about the Intel-gfx
mailing list