[Intel-gfx] [PATCH 08/11] drm/i915/execlists: Force preemption via reset on timeout

Wed Mar 28 08:59:09 UTC 2018

Quoting Chris Wilson (2018-03-27 10:23:26)
> Quoting Tvrtko Ursulin (2018-03-27 09:51:20)
> > 
> > On 26/03/2018 12:50, Chris Wilson wrote:
> > > +static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
> > > +{
> > > +     struct intel_engine_execlists *execlists =
> > > +             container_of(hrtimer, typeof(*execlists), preempt_timer);
> > > +
> > > +     GEM_TRACE("%s\n",
> > > +               container_of(execlists,
> > > +                            struct intel_engine_cs,
> > > +                            execlists)->name);
> > > +
> > > +     queue_work(system_highpri_wq, &execlists->preempt_reset);
> > 
> > I suppose indirection from hrtimer to worker is for better than jiffie 
> > timeout granularity? But then queue_work might introduce some delay to 
> > defeat that.
> 
> Yes. It's betting on highpri_wq being just that. We can do better with
> our own RT kthread and a wakeup from the hrtimer if required.
> 
> Hmm, the original plan (watchdog TDR) was to avoid using the global
> reset entirely. The per-engine reset (although needs serialising with
> itself) doesn't need struct_mutex, so we should be able to do that from
> the timer directly, and just kick off the global reset on failure.
> 
> That sounds a whole lot better, let's dust off that code and see what
> breaks.

So I think something along the lines of

+static int try_execlists_reset(struct intel_engine_execlists *execlists)
+{
+       struct intel_engine_cs *engine =
+               container_of(execlists, typeof(*engine), execlists);
+       int err = -EBUSY;
+
+       if (!test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted) &&
+           tasklet_trylock(&execlists->tasklet)) {
+               unsigned long *lock = &engine->i915->gpu_error.flags;
+               unsigned int bit = I915_RESET_ENGINE + engine->id;
+
+               if (!test_and_set_bit(bit, lock)) {
+                       tasklet_disable(&engine->execlists.tasklet);
+                       err = i915_reset_engine(engine,
+                                               "preemption timeout");
+                       tasklet_enable(&engine->execlists.tasklet);
+
+                       clear_bit(bit, lock);
+                       wake_up_bit(lock, bit);
+               }
+
+               tasklet_unlock(&execlists->tasklet);
+       }
+
+       return err;
+}

should do the trick, with a fallback to worker+i915_handle_error for the
cases where we can't claim ensure softirq safety.

There's still the issue of two resets in quick succession being treated
as a failure. That's also an issue for the current per-engine failover
to whole-device reset.
-Chris