[Mesa-dev] [PATCH] RFC: Externd IMG_context_priority with NV_context_priority_realtime

Fri Mar 30 18:20:57 UTC 2018

On Friday, March 30, 2018 7:40:13 AM PDT Chris Wilson wrote:
> NV_context_priority_realtime
> https://www.khronos.org/registry/EGL/extensions/NV/EGL_NV_context_priority_realtime.txt
> 
>     "This extension allows an EGLContext to be created with one extra
>     priority level in addition to three priority levels that are part of
>     EGL_IMG_context_priority extension.
> 
>     This new level has extra privileges that are not available to other three
>     levels. Some of the privileges may include:
>     - Allow realtime priority to only few contexts
>     - Allow realtime priority only to trusted applications
>     - Make sure realtime priority contexts are executed immediately
>     - Preempt any current context running on GPU on submission of
>       commands for realtime context"
> 
> At its most basic, it just adds an extra enum and level into the existing
> context priority framework.
> 
> For i915, we are proposing to use a quality-of-service parameter in
> addition to that of just a priority that usurps everyone. Due to our HW,
> preemption may not be immediate and will be forced to wait until an
> uncooperative process hits an arbitration point. To prevent that unduly
> impacting the privileged RealTime context, we back up the preemption
> request with a timeout to reset the GPU and forcibly evict the GPU hog
> in order to execute the new context.

I am strongly against exposing this in general.  Performing a GPU reset
in the middle of a batch can completely screw up whatever application
was running.  If the application is using robustness extensions, we may
be forced to return GL_DEVICE_LOST, causing the application to have to
recreate their entire GL context and start over.  If not, we may try to
let them limp on(*) - and hope they didn't get too badly damaged by some
of their commands not executing, or executing twice (if the kernel tries
to resubmit it).  But it may very well cause the app to misrender, or
even crash.

This seems like a crazy plan to me.  Scheduling has never been allowed
to just kill random processes.  If you ever hit that case, then your
customers will see random application crashes, glitches, GPU hangs,
and be pretty unhappy with the result.  And not because something was
broken, but because somebody was impatient and an app was a bit slow.

If you have work that is so mission critical, maybe you shouldn't run it
on the same machine as one that runs applications which you care so
little about that you're willing to watch them crash and burn.  Don't
run the entertainment system on the flight computer, so to speak.

At any rate, I suspect you wouldn't go down this path unless you
absolutely had to, and there was some incredible forcing function at
play.  Which is why I said "against exposing this in general".  Maybe
you have a customer that's willing to play with fire.  I just wanted
to make it very abundantly clear that this is hazardous.

--Ken

(*) We don't actually let things limp along after a bad hang today, if
    execbuf fails we just exit(1) and let it crash and burn.  We really
    should fix that (but I need to fix some state tracking bugs first).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180330/2d5d5278/attachment.sig>