[Mesa-dev] [PATCH] RFC: Externd IMG_context_priority with NV_context_priority_realtime

Sat Mar 31 11:00:16 UTC 2018

Quoting Kenneth Graunke (2018-03-30 19:20:57)
> On Friday, March 30, 2018 7:40:13 AM PDT Chris Wilson wrote:
> > For i915, we are proposing to use a quality-of-service parameter in
> > addition to that of just a priority that usurps everyone. Due to our HW,
> > preemption may not be immediate and will be forced to wait until an
> > uncooperative process hits an arbitration point. To prevent that unduly
> > impacting the privileged RealTime context, we back up the preemption
> > request with a timeout to reset the GPU and forcibly evict the GPU hog
> > in order to execute the new context.
> 
> I am strongly against exposing this in general.  Performing a GPU reset
> in the middle of a batch can completely screw up whatever application
> was running.  If the application is using robustness extensions, we may
> be forced to return GL_DEVICE_LOST, causing the application to have to
> recreate their entire GL context and start over.  If not, we may try to
> let them limp on(*) - and hope they didn't get too badly damaged by some
> of their commands not executing, or executing twice (if the kernel tries
> to resubmit it).  But it may very well cause the app to misrender, or
> even crash.

Yes, I think the revulsion has been universal. However, as a
quality-of-service guarantee, I can understand the appeal. The
difference is that instead of allowing a DoS for 6s or so as we
currently allow, we allow that to be specified by the context. As it
does allow one context to impact another, I want it locked down to
privileged processes. I have been using CAP_SYS_ADMIN as the potential
to do harm is even greater than exploiting the weak scheduler by
changing priority.

> This seems like a crazy plan to me.  Scheduling has never been allowed
> to just kill random processes.

That's not strictly true, as processes have their limits which if they
exceed they will be killed. On the CPU preemption is much better, the
issue of unyielding processes is pretty much limited to the kernel, where
we can run the NMI watchdog to kill broken code.

> If you ever hit that case, then your
> customers will see random application crashes, glitches, GPU hangs,
> and be pretty unhappy with the result.  And not because something was
> broken, but because somebody was impatient and an app was a bit slow.

Yes, that is their decision. Kill random apps so that their
uber-critical interface updates the clock.

> If you have work that is so mission critical, maybe you shouldn't run it
> on the same machine as one that runs applications which you care so
> little about that you're willing to watch them crash and burn.  Don't
> run the entertainment system on the flight computer, so to speak.

You are not the first to say that ;)
-Chris