[Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Tue Sep 20 06:22:48 UTC 2022
On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
>
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
> > On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > > From: Matthew Brost <matthew.brost at intel.com>
> > > > >
> > > > > +static void guc_context_sched_disable(struct intel_context *ce)
> > > > > +{
> > > > > + struct intel_guc *guc = ce_to_guc(ce);
> > > > > + u64 delay = guc->submission_state.sched_disable_delay_ms;
> > > > > + unsigned long flags;
> > > > > +
> > > > > + spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > > > +
> > > > > + if (bypass_sched_disable(guc, ce)) {
> > > > > + spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > + intel_context_sched_disable_unpin(ce);
> > > > > + } else if (!intel_context_is_closed(ce) &&
> > > > > !guc_id_pressure(guc, ce) &&
> > > > > + delay) {
> > > > > spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > - goto unpin;
> > > > > + mod_delayed_work(system_unbound_wq,
> > > > > + &ce->guc_state.sched_disable_delay,
> > > > > + msecs_to_jiffies(delay));
> > > > > + } else {
> > > > > + do_sched_disable(guc, ce, flags);
> > > > > }
> > > >
> > > > lock
> > > > if
> > > > unlock
> > > > do stuff
> > > > else if
> > > > unlock
> > > > do stuff
> > > > else
> > > > do_sched_disable - which unlocks inside
> > > >
> > > > IMO it creates less readable code for the benefit of not repeating
> > > > with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno..
> > > > it's ugly but I have no suggestions. Hm does it have to send using the
> > > > busy loop? It couldn't just queue the request and then wait for
> > > > reply if
> > > > disable message was emitted?
> > > >
> > > I agree that the above code lacks readability - will see if i can
> > > break it down to smaller
> > > functions with cleaner in-function lock/unlock pairs. I agree that a
> > > little code duplication
> > > is better than less readable code. It was inherited code i didn't
> > > want to modify but I'll
> > > go ahead and refactor this.
> > >
> > > On the busy loop - im assuming you are refering to the actual ct
> > > sending. I'll consult my
> > > team if i am missing anything more but based on comments, I believe
> > > callers must use that
> > > function to guarantee reservation of space in the G2H CTB to always
> > > have space to capture
> > > responses for actions that MUST be acknowledged from GuC
> > > (acknowledged by either replying
> > > with a success or failure). This is necessary for coherent guc-id
> > > state machine (because the
> > > GuC firmware will drop requests for guc-id's that are not registered
> > > or not in a
> > > 'sched-enabled' state).
> >
> > Maybe to explain what I meant a bit better. I thought that the reason
> > for strange unlock pattern is the with_runtime_pm which needs to
> > happen for the CT send/receive loop. What I meant was would it be
> > possible to do it like this:
> >
Alan: I assumed the strange lock-unlock was for no other reason than to ensure that
once the delayed task begins executing, we dont let go of the lock until after
we complete the call prep_context_pending_disable (which modifies the state-machine
for the context's guc-id). And as we expect, the runtime_pm is about trying to send
the H2G signal to the GuC which touches register. (i.e. lock was on ce-gucid state
and power was for guc-interaction via ct). If my assumption are correct, then I
can refactor the functions a bit to remove that weird unlock flow while keeping the
lock held from start until prep_context_pending_disable, else i could
still go ahead and do that (with some minimal duplication).
> > state lock
> > ...
> > sent = queue_msg_2_guc (this would be i915 only, no runtime pm needed)
> > ...
> > state unlock
> >
> > if (sent)
> > with_runtime_pm:
> > send/receive queued guc messages (only this would talk to guc)
> >
> > But I have truly no idea if the CT messaging infrastructure supports
> > something like this.
> >
Alan: I'll look around and see if we have something that provides that flow
but i don't think I should pursue such a redesign as part of this series
if none exists.
> > Anyway, see what it is possible and if it is not or too hard for now
> > leave it be.
> >
> > > > > - guc_id = prep_context_pending_disable(ce);
> > > > > +}
> > > > > - spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct
> > > > > intel_guc *guc)
> > > > > +{
> > > > > + struct intel_context *ce;
> > > > > + unsigned long index;
> > > > > + unsigned long flags;
> > > > > + unsigned long ceflags;
> > > > > - with_intel_runtime_pm(runtime_pm, wakeref)
> > > > > - __guc_context_sched_disable(guc, ce, guc_id);
> > > > > + xa_lock_irqsave(&guc->context_lookup, flags);
> > > > > + xa_for_each(&guc->context_lookup, index, ce) {
> > > > > + if (!kref_get_unless_zero(&ce->ref))
> > > > > + continue;
> > > > > + xa_unlock(&guc->context_lookup);
> > > >
> > > > So this whole loop _needs_ to run with interrupts disabled? Explaining
> > > > why in a comment would be good.
> > > >
> > > Being mid-reset, the locking mode is consistent with other functions
> > > also used
> > > as part of the reset preparation that parses and potentially modifies
> > > contexts.
> > > I believe the goal is to handle all of this parsing without getting
> > > conflicting
> > > latent G2H replies that breaks the preparation flow (that herds
> > > active contexts
> > > into a fewer set of states as part of reset) - but i will double check
> > > with my colleagues.
> > >
> > > > > + if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > > > > + cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > > > > + spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > > > > + spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> > > >
> > > > This deserves a comment about what lock toggling wants to ensure.
> > > >
> > > I realize this might have been my local rebasing mistake, the
> > > intention was to
> > > handle cases where sched_disable_delay wasn't pending but potentially
> > > still
> > > executing do_sched_disable. I believe I could try
> > > cancel_delayed_work_sync (but
> > > not sure if i can call that might-sleep funtion mid reset while not-
> > > interruptible). Else, i would move that lock-unlock to if the
> > > cancel_delayed_work did not return true (as per original intent
> > > before my
> > > rebase error).
> >
> > Okay a comment like "flush any currently executing do_sched_disable"
> > above the lock toggling would do then. Even better if you add what
> > happens if that flushing isn't done.
> >
got it. will do.
> > > > Also, if the loops runs with interrupts disabled what is the point of
> > > > irqsave variant in here??
> > > Yes - its redundant, let me fix that, apologies for that.
> > >
> > > > Also2, what is the reason for dropping the lock? intel_context_put?
> > > Being consistent with other reset preparation code that closes contexts,
> > > the lock is dropped before the intel_context_put.
> > > (I hope i am not misunderstanding your question).
> >
> > Right, okay.. so for locking inversion issues - intel_context_put must
> > not be called with guc_state.lock held, yes?
> >
> > Not a mandatory request, but if you want you could look into the
> > option of avoiding lock dropping and instead doing xa_erase and
> > recording the list of contexts for sched disable or put in a local
> > list, and doing them all in one block outside the locked/irq disabled
> > section. Although tbh I am not sure if that would improve anything.
> > Probably not in this case of a reset path, but if there are other
> > places in GuC code which do this it may be beneficial for less
> > hammering on the CPU and core synchronisation for atomics.
> >
> > > > > + /*
> > > > > + * If the context gets closed while the execbuf is ongoing,
> > > > > the context
> > > > > + * close code will race with the below code to cancel the
> > > > > delayed work.
> > > > > + * If the context close wins the race and cancels the work, it
> > > > > will
> > > > > + * immediately call the sched disable (see guc_context_close),
> > > > > so there
> > > > > + * is a chance we can get past this check while the
> > > > > sched_disable code
> > > > > + * is being executed. To make sure that code completes before
> > > > > we check
> > > > > + * the status further down, we wait for the close process to
> > > > > complete.
> > > > > + */
> > > > > + if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > + intel_context_sched_disable_unpin(ce);
> > > > > + else if (intel_context_is_closed(ce))
> > > > > + wait_for(context_close_done(ce), 1);
> > > >
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > >
> > > will add the consequence (i believe the consequence is that we could
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > >
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent
> > > sometime trying to
> >
> > One millisecond even. :)
>
hehe - typo :)
> Normally 1ms is not a slow time for this. We can only hit the wait if
> the context_close call has already called the cancel_delayed_work, so
> the only thing left to do in that function is to send the H2G, which is
> relatively fast. However, I've just realized that if the H2G buffer is
> full the H2G will stall, so in that case it can take longer (maximum
> stall time before a hang is declared is 1.5s).
>
Daniele, so increase to 1500 milisecs?
> > What would be the consequence of prematurely reading the still
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes?
> > Something else?
>
> i915 thinks that a request pending with the GuC, while GuC thinks we
> disabled it (because the completion of the delayed_disable happens after
> the new request has been submitted). The heartbeat will still go
> through, so no reset will be triggered, the request is just lost. A new
> request submission on the same context should be able to recover it, but
> we didn't test that.
>
>
> > And why cant' this race with context close happen at any other point
> > than this particular spot in guc_request_alloc? Immediately after the
> > added checks? After atomic_add_unless?
>
> The race is tied to the canceling of the delayed work. context_close
> only does something if it cancels the delayed work, so if the
> cancel_delayed_work_sync here does the cancelling then context_close is
> a no-op.
>
> > > figure out portions of the SCHED_foo state machine bits and believe
> > > that its possible
> > > for guc_request_alloc to just force context_close to be done from
> > > here as it would
> > > force it into a state requiring re-registration and would close that
> > > a few lines
> > > below. I will however verify with my team mates as i am new to these
> > > SCHED_foo state
> > > machine bits.
>
> I'm not sure we want to force context_close unconditionally here, that'd
> be a big overhead. Doing it only if there is a close pending is also
> potentially an issue, the whole point is that the close can race in. The
> close also has to work on its own, because in the normal use-cases we
> don't get a context_close while a request submission is ongoing.
> Unless you meant something different and I completely misunderstood.
>
> Daniele
>
> > Yes it always looked complicated and perhaps it has even grown more so
> > - I'm afraid I cannot be of any help there.
> >
> > Regards,
> >
> > Tvrtko
More information about the Intel-gfx
mailing list