[Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

Tue Oct 29 18:02:52 UTC 2019

Quoting Jason Ekstrand (2019-10-29 16:19:09)
> 
> 
> On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson <chris at chris-wilson.co.uk> wrote:
> 
>     Quoting Jason Ekstrand (2019-10-25 19:22:04)
>     > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris at chris-wilson.co.uk>
>     wrote:
>     >
>     >     Our existing behaviour is to allow contexts and their GPU requests to
>     >     persist past the point of closure until the requests are complete.
>     This
>     >     allows clients to operate in a 'fire-and-forget' manner where they
>     can
>     >     setup a rendering pipeline and hand it over to the display server and
>     >     immediately exiting. As the rendering pipeline is kept alive until
>     >     completion, the display server (or other consumer) can use the
>     results
>     >     in the future and present them to the user.
>     >
>     >     However, not all clients want this persistent behaviour and would
>     prefer
>     >     that the contexts are cleaned up immediately upon closure. This
>     ensures
>     >     that when clients are run without hangchecking, any GPU hang is
>     >     terminated with the process and does not continue to hog resources.
>     >
>     >     By defining a context property to allow clients to control
>     persistence
>     >     explicitly, we can remove the blanket advice to disable hangchecking
>     >     that seems to be far too prevalent.
>     >
>     >
>     > Just to be clear, when you say "disable hangchecking" do you mean
>     disabling it
>     > for all processes via a kernel parameter at boot time or a sysfs entry or
>     > similar?  Or is there some mechanism whereby a context can request no
>     hang
>     > checking?
> 
>     They are being told to use the module parameter i915.enable_hangcheck=0
>     to globally disable hangchecking. This is what we are trying to wean
>     them off, and yet still allow indefinitely long kernels. The softer
>     hangcheck is focused on if you block scheduling or preemption of higher
>     priority work, then you are forcibly removed from the GPU. However, even
>     that is too much for some workloads, where they really do expect to
>     permanently hog the GPU. (All I can say is that they better be dedicated
>     systems because if you demand interactivity on top of disabling
>     preemption...)
> 
> 
> Ok, thinking out loud here (no need for you to respond):  Why should we take
> this approach?  It seems like there are several other ways we could solve this:
> 
>  1. Have a per-context flag (that's what we did here)
>  2. Have a per-execbuf flag for "don't allow this execbuf to outlive the
> process".
>  3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the
> context
> 
> Option 2 seems like a lot more work in i915 and it doesn't seem that
> advantageous.  Most drivers are going to either want their batches to outlive
> them or not; they aren't going to be making that decision on a per-batch basis.

And processing each batch on context close, deciding how to carry
forward the different options doesn't sound like fun.

> Option 3 would work for some cases but it doesn't let the kernel terminate work
> if the client is killed unexpectedly by, for instance a segfault.  The client
> could try to insert a crash handler but calling a DRM ioctl from a crash
> handler sounds like a bad plan.  On the other hand, the client can just as
> easily implement 3 by setting the new context flag and then calling
> GEM_CONTEXT_DESTROY.

Exactly. Abnormal process termination is the name of the game.

> With that, I think I'm convinced that a context param is the best way to do
> this.  We may even consider using it in Vulkan when running headless to let us
> kill stuff quicker.  We aren't seeing any long-running Vulkan compute workloads
> yet but they may be coming.

Yeah, might it find a use for GL robustness? If the app wants a
guarantee that its resources are garbage collected along with it?

> Acked-by: Jason Ekstrand <jason at jlekstrand.net>
> 
> 
> One more question: Does this bit fully support being turned on and off or is it
> a set-once?  I ask because how I'd likely go about doing this in Vulkan would
> be to set it on context create and then unset it the moment we see a buffer
> shared with the outside world.

You can set it as many times as you like, it only takes effect on
context termination. That comes back to whether you want to evaluate it
as a batch/execbuf attribute or on the context. The starting point for
the patches was to close the process termination hole, of which this is
the natural extension for a context to opt into being cancelled on
closure. I think an all-or-nothing makes sense, or rather I don't see
enough advantage in the per-batch attribute to justify the processing.
-Chris