[Intel-gfx] [PATCH 1/1] drm/i915: Check if engine has heartbeat when closing a context
Matthew Brost
matthew.brost at intel.com
Fri Jul 30 18:13:38 UTC 2021
On Fri, Jul 30, 2021 at 10:49:01AM +0100, Tvrtko Ursulin wrote:
>
> On 30/07/2021 01:13, John Harrison wrote:
> > On 7/28/2021 17:34, Matthew Brost wrote:
> > > If an engine associated with a context does not have a heartbeat, ban it
> > > immediately. This is needed for GuC submission as a idle pulse doesn't
> > > kick the context off the hardware where it then can check for a
> > > heartbeat and ban the context.
>
> Pulse, that is a request with I915_PRIORITY_BARRIER, does not preempt a
> running normal priority context?
>
Yes, in both execlists and GuC submission the contexts gets preempted.
With execlists the i915 see the preempt CSB while with GuC submission
the GUC sees it.
> Why does it matter then whether or not heartbeats are enabled - when
> heartbeat just ends up sending the same engine pulse (eventually, with
> raising priority)?
>
With execlists when the request gets resubmitted, there is check if the
context is closed and the heartbeat is disabled. If this is true, the
context gets banned. See __execlists_schedule_in.
With the Guc sense it owns the CSB / resubmission, the heartbeat /
closed check doesn't exist to ban the context.
> > It's worse than this. If the engine in question is an individual
> > physical engine then sending a pulse (with sufficiently high priority)
> > will pre-empt the engine and kick the context off. However, the GuC
>
> Why it is different for physical vs virtual, aren't both just schedulable
> contexts with different engine masks for what GuC is concerned? Oh, is it a
> matter of needing to send pulses to all engines which comprise a virtual
> one?
Yes. The whole idle pulse thing is kinda junk. It really makes an
assumption that the backend is execlists. We likely have a bit more work
here.
>
> > scheduler does not have hacks in it to check the state of the heartbeat
> > or whether a context is actually a zombie or not. Thus, the context will
> > get resubmitted to the hardware after the pulse completes and
> > effectively nothing will have happened.
> >
> > I would assume that the DRM scheduler which we are meant to be switching
> > to for execlist as well as GuC submission is also unlikely to have hacks
> > for zombie contexts and tests for whether the i915 specific heartbeat
> > has been disabled since the context became a zombie. So when that switch
> > happens, this test will also fail in execlist mode as well as GuC mode.
> >
> > The choices I see here are to simply remove persistence completely (it
> > is a basically a bug that became UAPI because it wasn't caught soon
> > enough!) or to implement it in a way that does not require hacks in the
> > back end scheduler. Apparently, the DRM scheduler is expected to allow
> > zombie contexts to persist until the DRM file handle is closed. So
> > presumably we will have to go with option two.
> >
> > That means flagging a context as being a zombie when it is closed but
> > still active. The driver would then add it to a zombie list owned by the
> > DRM client object. When that client object is closed, i915 would go
> > through the list and genuinely kill all the contexts. No back end
> > scheduler hacks required and no intimate knowledge of the i915 heartbeat
> > mechanism required either.
> >
> > John.
> >
> >
> > >
> > > This patch also updates intel_engine_has_heartbeat to be a vfunc as we
> > > now need to call this function on execlists virtual engines too.
> > >
> > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > ---
> > > drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++--
> > > drivers/gpu/drm/i915/gt/intel_context_types.h | 2 ++
> > > drivers/gpu/drm/i915/gt/intel_engine.h | 21 ++-----------------
> > > .../drm/i915/gt/intel_execlists_submission.c | 14 +++++++++++++
> > > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 +++++-
> > > .../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 --
> > > 6 files changed, 26 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > index 9c3672bac0e2..b8e01c5ba9e5 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > > @@ -1090,8 +1090,9 @@ static void kill_engines(struct
> > > i915_gem_engines *engines, bool ban)
> > > */
> > > for_each_gem_engine(ce, engines, it) {
> > > struct intel_engine_cs *engine;
> > > + bool local_ban = ban || !intel_engine_has_heartbeat(ce->engine);
>
> In any case (pending me understanding what's really going on there), why
> would this check not be in kill_context with currently does this:
>
> bool ban = (!i915_gem_context_is_persistent(ctx) ||
> !ctx->i915->params.enable_hangcheck);
> ...
This gem_context level check, while the other check is per
intel_context. We don't have the intel_context here.
> kill_engines(pos, ban);
>
> So whether to ban decision would be consolidated to one place.
>
> In fact, decision on whether to allow persistent is tied to
> enable_hangcheck, which also drives hearbeat emission. So perhaps one part
> of the correct fix is to extend the above (kill_context) ban criteria to
> include hearbeat values anyway. Otherwise isn't it a simple miss that this
> check fails to account to hearbeat disablement via sysfs?
>
The execlists has that check in the resubmission path which doesn't
exist for the GuC (explained above). This code just moves this check to
a place where it works with GuC submission.
Matt
> Regards,
>
> Tvrtko
>
> > > - if (ban && intel_context_ban(ce, NULL))
> > > + if (local_ban && intel_context_ban(ce, NULL))
> > > continue;
> > > /*
> > > @@ -1104,7 +1105,7 @@ static void kill_engines(struct
> > > i915_gem_engines *engines, bool ban)
> > > engine = active_engine(ce);
> > > /* First attempt to gracefully cancel the context */
> > > - if (engine && !__cancel_engine(engine) && ban)
> > > + if (engine && !__cancel_engine(engine) && local_ban)
> > > /*
> > > * If we are unable to send a preemptive pulse to bump
> > > * the context from the GPU, we have to resort to a full
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > index e54351a170e2..65f2eb2a78e4 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > @@ -55,6 +55,8 @@ struct intel_context_ops {
> > > void (*reset)(struct intel_context *ce);
> > > void (*destroy)(struct kref *kref);
> > > + bool (*has_heartbeat)(const struct intel_engine_cs *engine);
> > > +
> > > /* virtual engine/context interface */
> > > struct intel_context *(*create_virtual)(struct intel_engine_cs
> > > **engine,
> > > unsigned int count);
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h
> > > b/drivers/gpu/drm/i915/gt/intel_engine.h
> > > index c2a5640ae055..1b11a808acc4 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> > > @@ -283,28 +283,11 @@ struct intel_context *
> > > intel_engine_create_virtual(struct intel_engine_cs **siblings,
> > > unsigned int count);
> > > -static inline bool
> > > -intel_virtual_engine_has_heartbeat(const struct intel_engine_cs *engine)
> > > -{
> > > - /*
> > > - * For non-GuC submission we expect the back-end to look at the
> > > - * heartbeat status of the actual physical engine that the work
> > > - * has been (or is being) scheduled on, so we should only reach
> > > - * here with GuC submission enabled.
> > > - */
> > > - GEM_BUG_ON(!intel_engine_uses_guc(engine));
> > > -
> > > - return intel_guc_virtual_engine_has_heartbeat(engine);
> > > -}
> > > -
> > > static inline bool
> > > intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> > > {
> > > - if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> > > - return false;
> > > -
> > > - if (intel_engine_is_virtual(engine))
> > > - return intel_virtual_engine_has_heartbeat(engine);
> > > + if (engine->cops->has_heartbeat)
> > > + return engine->cops->has_heartbeat(engine);
> > > else
> > > return READ_ONCE(engine->props.heartbeat_interval_ms);
> > > }
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index de5f9c86b9a4..18005b5546b6 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -3619,6 +3619,18 @@ virtual_get_sibling(struct intel_engine_cs
> > > *engine, unsigned int sibling)
> > > return ve->siblings[sibling];
> > > }
> > > +static bool virtual_engine_has_heartbeat(const struct
> > > intel_engine_cs *ve)
> > > +{
> > > + struct intel_engine_cs *engine;
> > > + intel_engine_mask_t tmp, mask = ve->mask;
> > > +
> > > + for_each_engine_masked(engine, ve->gt, mask, tmp)
> > > + if (READ_ONCE(engine->props.heartbeat_interval_ms))
> > > + return true;
> > > +
> > > + return false;
> > > +}
> > > +
> > > static const struct intel_context_ops virtual_context_ops = {
> > > .flags = COPS_HAS_INFLIGHT,
> > > @@ -3634,6 +3646,8 @@ static const struct intel_context_ops
> > > virtual_context_ops = {
> > > .enter = virtual_context_enter,
> > > .exit = virtual_context_exit,
> > > + .has_heartbeat = virtual_engine_has_heartbeat,
> > > +
> > > .destroy = virtual_context_destroy,
> > > .get_sibling = virtual_get_sibling,
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 89ff0e4b4bc7..ae70bff3605f 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -2168,6 +2168,8 @@ static int guc_virtual_context_alloc(struct
> > > intel_context *ce)
> > > return lrc_alloc(ce, engine);
> > > }
> > > +static bool guc_virtual_engine_has_heartbeat(const struct
> > > intel_engine_cs *ve);
> > > +
> > > static const struct intel_context_ops virtual_guc_context_ops = {
> > > .alloc = guc_virtual_context_alloc,
> > > @@ -2183,6 +2185,8 @@ static const struct intel_context_ops
> > > virtual_guc_context_ops = {
> > > .enter = guc_virtual_context_enter,
> > > .exit = guc_virtual_context_exit,
> > > + .has_heartbeat = guc_virtual_engine_has_heartbeat,
> > > +
> > > .sched_disable = guc_context_sched_disable,
> > > .destroy = guc_context_destroy,
> > > @@ -3029,7 +3033,7 @@ guc_create_virtual(struct intel_engine_cs
> > > **siblings, unsigned int count)
> > > return ERR_PTR(err);
> > > }
> > > -bool intel_guc_virtual_engine_has_heartbeat(const struct
> > > intel_engine_cs *ve)
> > > +static bool guc_virtual_engine_has_heartbeat(const struct
> > > intel_engine_cs *ve)
> > > {
> > > struct intel_engine_cs *engine;
> > > intel_engine_mask_t tmp, mask = ve->mask;
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > index c7ef44fa0c36..c2afc3b88fd8 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > @@ -29,8 +29,6 @@ void intel_guc_dump_active_requests(struct
> > > intel_engine_cs *engine,
> > > struct i915_request *hung_rq,
> > > struct drm_printer *m);
> > > -bool intel_guc_virtual_engine_has_heartbeat(const struct
> > > intel_engine_cs *ve);
> > > -
> > > int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> > > atomic_t *wait_var,
> > > bool interruptible,
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
More information about the Intel-gfx
mailing list