[Intel-gfx] [RFC] drm/i915: Don't reset on preemptible workloads

Chris Wilson chris at chris-wilson.co.uk
Thu Aug 2 09:14:30 UTC 2018


Quoting Bartminski, Jakub (2018-08-02 09:56:10)
> On Wed, 2018-08-01 at 15:22 +0100, Chris Wilson wrote:
> > Quoting Jakub BartmiĊ„ski (2018-08-01 14:56:11)
> > > [...]
> > > +       /* We have already tried preempting, but the hardware did
> > > not react */
> > > +       if (engine->hangcheck.try_preempt)
> > > +               return false;
> > > +
> > > +       active_request = i915_gem_find_active_request(engine);
> > 
> > No ownership here of rq. You either need a new function to return a
> > reference, or careful application of rcu.
> 
> This is a fair point, I'm assuming without taking ownership here we may
> end up with a nasty race?
> 
> > > [...]
> > > +               /*
> > > +                * If the workload is preemptible but its context
> > > was closed
> > > +                * we force the engine to skip its execution
> > > instead.
> > > +                */
> > > +               if (i915_gem_context_is_closed(active_context))
> > > +                       skip_seqno = true;
> > 
> > Nope. Even if the context is closed, its results may be depended upon
> > by others; both GPU and userspace.
> 
> While there may be dependencies, we currently break these dependencies
> anyway by resetting the entire engine, don't we? Couldn't we signal an
> error for the skipped requests as if we reseted the engine and continue
> on with the execution of the remaining requests?

That's what we do currently, but we don't bother propagating it past the
immediate request. I've played with that in the past, especially
important for propagating fatal errors from async workers.

> If we don't reset/skip
> we are left with a request that potentially spins forever, with blocked
> dependent requests, and since it's basically the halting problem we
> have to draw the line somewhere.

Exactly. Why bother as we could just reset the engine within a few
microseconds.

> > [...]
> > So the gist of this is to excuse lowest priority user contexts from
> > hangcheck, do we want to go one step further and ask userspace to opt
> > in explicitly? 
> 
> Making it explicit opt-in was the idea here, doing it by setting the
> lowest priority was just a rough way of accomplishing that.
> 
> > Though the argument that as long as its preemptible (and isolated) it
> > is not acting as DoS so we don't need to enforce a timeout. That also
> > suggests that we need not reset until the request is blocking others
> > or userspace.
> 
> So if I'm understanding correctly you are suggesting that if we have
> preemption enabled we could drop the hangcheck entirely, and just reset
> if preemption fails?

Yes, but we still need the reset if we are stuck at the head of a chain.
 
> > As discussed that seems reasonable, but I think this does need to
> > actually trigger preemption.
> > -Chris
> 
> I'm not sure what you mean right here by "actually trigger preemption",
> since this patch does trigger the preemption, could you clarify?

I glossed over the direct call as that is not the way it should be done
:-p

> As I've mentioned before, one other problem with this patch is that we
> can't have more than one "background" request working before the first
> one finishes. While this could be solved at hangcheck by putting the
> background requests in the back of the background-priority queue
> instead of at the front during preemption, this breaks if the
> background requests are dependent on each other. These dependencies
> could be resolved in the tasklet (only in the case hangcheck was called
> so it wouldn't slow us down too much, they could also be resolved in
> the hangcheck itself but the timeline could change in between the
> hangcheck and the tasklet, which would make doing it correctly more
> complicated I think).

I think the very simple rule that you reset if there are any
dependencies and demote otherwise covers that.
-Chris


More information about the Intel-gfx mailing list