[Mesa-dev] [PATCH] RFC: Externd IMG_context_priority with NV_context_priority_realtime
Chris Wilson
chris at chris-wilson.co.uk
Sat Mar 31 19:59:18 UTC 2018
Quoting Kenneth Graunke (2018-03-31 20:29:28)
> On Saturday, March 31, 2018 5:56:57 AM PDT Chris Wilson wrote:
> > Quoting Chris Wilson (2018-03-31 12:00:16)
> > > Quoting Kenneth Graunke (2018-03-30 19:20:57)
> > > > On Friday, March 30, 2018 7:40:13 AM PDT Chris Wilson wrote:
> > > > > For i915, we are proposing to use a quality-of-service parameter in
> > > > > addition to that of just a priority that usurps everyone. Due to our HW,
> > > > > preemption may not be immediate and will be forced to wait until an
> > > > > uncooperative process hits an arbitration point. To prevent that unduly
> > > > > impacting the privileged RealTime context, we back up the preemption
> > > > > request with a timeout to reset the GPU and forcibly evict the GPU hog
> > > > > in order to execute the new context.
> > > >
> > > > I am strongly against exposing this in general. Performing a GPU reset
> > > > in the middle of a batch can completely screw up whatever application
> > > > was running. If the application is using robustness extensions, we may
> > > > be forced to return GL_DEVICE_LOST, causing the application to have to
> > > > recreate their entire GL context and start over. If not, we may try to
> > > > let them limp on(*) - and hope they didn't get too badly damaged by some
> > > > of their commands not executing, or executing twice (if the kernel tries
> > > > to resubmit it). But it may very well cause the app to misrender, or
> > > > even crash.
> > >
> > > Yes, I think the revulsion has been universal. However, as a
> > > quality-of-service guarantee, I can understand the appeal. The
> > > difference is that instead of allowing a DoS for 6s or so as we
> > > currently allow, we allow that to be specified by the context. As it
> > > does allow one context to impact another, I want it locked down to
> > > privileged processes. I have been using CAP_SYS_ADMIN as the potential
> > > to do harm is even greater than exploiting the weak scheduler by
> > > changing priority.
>
> Right...I was thinking perhaps a tunable to reduce the 6s would do the
> trick, and be much less complicated...but perhaps you want to let it go
> longer when there isn't super-critical work to do.
If (mid-object) preemption worked properly, we wouldn't see many GPU
hangs at all, depending on free the compositor is to inject work. Oh boy,
that suggests we need to rethink the current hangcheck.
Bring on timeslicing.
> > Also to add further insult to injury, we might want to force GPU clocks
> > to max for the RT context (so that the context starts executing at max
> > rather than wait for the system to upclock on load). Something like,
>
> That makes some sense - but I wonder if it wouldn't cause more battery
> burn than is necessary. The super-critical workload may also be
> relatively simple (redrawing a clock), and so up-clocking and
> down-clocking again might hurt us...it's hard to say. :(
>
> I also don't know what I think of this plan to let userspace control
> (restrict) the frequency. That's been restricted to root (via sysfs)
> in the past. But I think you're allowing it more generally now, without
> CAP_SYS_ADMIN? It seems like there's a lot of potential for abuse.
> (Hello, benchmark mode! Zoooom!) I know it solves a problem, but it
> seems like there's got to be a better way...
It's restricting the range the system can choose, but only within the
range the sysadmin defines. The expected use case for me is actually
HTPC more than benchmark mode (what benchmark that doesn't run at max
clocks that needs to?). Where you have a workload you know needs a
narrow band of frequencies and want to conserve energy by not
overclocking, and also have a good idea of the minimum required to avoid
frame drops. Tricking the system to run at high clocks isn't that hard
today.
It just happens that historically RT processes force max CPU clocks, and
for something that demands a low latency QoS I expect to also have low
latency tolerance throughout the pipeline.
-Chris
More information about the mesa-dev
mailing list