[Intel-gfx] [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close

Wed Aug 7 13:22:32 UTC 2019

Quoting Chris Wilson (2019-08-06 14:47:25)
> @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
>  
>         i915_gem_context_set_bannable(ctx);
>         i915_gem_context_set_recoverable(ctx);
> +       if (i915_modparams.enable_hangcheck)
> +               i915_gem_context_set_persistence(ctx);

I am not fond of this, but from a pragmatic point of view, this does
prevent the issue Jon raised: HW resources being pinned indefinitely
past process termination.

I don't like it because we cannot perform the operation cleanly
everywhere, it requires preemption first and foremost (with a cooperating
submission backend) and per-engine reset. The alternative is to try and
do a full GPU reset if the context is still active. For the sake of the
issue raised, I think that (full reset on older HW) is required.

That we are baking in a change of ABI due to an unsafe modparam is meh.
There are a few more corner cases to deal with before endless just
works. But since it is being used in the wild, I'm not sure we can wait
for userspace to opt-in, or wait for cgroups. However, since users are
being encouraged to disable hangcheck, should we extend the concept of
persistence to also mean "survives hangcheck"? -- though it should be a
separate parameter, and I'm not sure how exactly to protect it from the
heartbeat reset without giving gross privileges to the context. (CPU
isolation is nicer from the pov where we can just give up and not even
worry about the engine. If userspace can request isolation, it has the
privilege to screw up.)
-Chris