[Intel-gfx] [PATCH 08/11] drm/i915/execlists: Force preemption via reset on timeout
Chris Wilson
chris at chris-wilson.co.uk
Wed Mar 28 08:59:09 UTC 2018
Quoting Chris Wilson (2018-03-27 10:23:26)
> Quoting Tvrtko Ursulin (2018-03-27 09:51:20)
> >
> > On 26/03/2018 12:50, Chris Wilson wrote:
> > > +static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
> > > +{
> > > + struct intel_engine_execlists *execlists =
> > > + container_of(hrtimer, typeof(*execlists), preempt_timer);
> > > +
> > > + GEM_TRACE("%s\n",
> > > + container_of(execlists,
> > > + struct intel_engine_cs,
> > > + execlists)->name);
> > > +
> > > + queue_work(system_highpri_wq, &execlists->preempt_reset);
> >
> > I suppose indirection from hrtimer to worker is for better than jiffie
> > timeout granularity? But then queue_work might introduce some delay to
> > defeat that.
>
> Yes. It's betting on highpri_wq being just that. We can do better with
> our own RT kthread and a wakeup from the hrtimer if required.
>
> Hmm, the original plan (watchdog TDR) was to avoid using the global
> reset entirely. The per-engine reset (although needs serialising with
> itself) doesn't need struct_mutex, so we should be able to do that from
> the timer directly, and just kick off the global reset on failure.
>
> That sounds a whole lot better, let's dust off that code and see what
> breaks.
So I think something along the lines of
+static int try_execlists_reset(struct intel_engine_execlists *execlists)
+{
+ struct intel_engine_cs *engine =
+ container_of(execlists, typeof(*engine), execlists);
+ int err = -EBUSY;
+
+ if (!test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted) &&
+ tasklet_trylock(&execlists->tasklet)) {
+ unsigned long *lock = &engine->i915->gpu_error.flags;
+ unsigned int bit = I915_RESET_ENGINE + engine->id;
+
+ if (!test_and_set_bit(bit, lock)) {
+ tasklet_disable(&engine->execlists.tasklet);
+ err = i915_reset_engine(engine,
+ "preemption timeout");
+ tasklet_enable(&engine->execlists.tasklet);
+
+ clear_bit(bit, lock);
+ wake_up_bit(lock, bit);
+ }
+
+ tasklet_unlock(&execlists->tasklet);
+ }
+
+ return err;
+}
should do the trick, with a fallback to worker+i915_handle_error for the
cases where we can't claim ensure softirq safety.
There's still the issue of two resets in quick succession being treated
as a failure. That's also an issue for the current per-engine failover
to whole-device reset.
-Chris
More information about the Intel-gfx
mailing list