[Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

Teres Alexis, Alan Previn alan.previn.teres.alexis at intel.com
Wed Sep 21 22:30:29 UTC 2022


On Fri, 2022-09-16 at 09:58 +0100, Tvrtko Ursulin wrote:
> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > From: Matthew Brost <matthew.brost at intel.com>
> > > > 
> > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
> > > > +{
> > > > +	struct intel_context *ce;
> > > > +	unsigned long index;
> > > > +	unsigned long flags;
> > > > +	unsigned long ceflags;
> > > >    
> > > > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > > > -		__guc_context_sched_disable(guc, ce, guc_id);
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		if (!kref_get_unless_zero(&ce->ref))
> > > > +			continue;
> > > > +		xa_unlock(&guc->context_lookup);
> > > 
> > > So this whole loop _needs_ to run with interrupts disabled? Explaining
> > > why in a comment would be good.
> > > 
> > Being mid-reset, the locking mode is consistent with other functions also used
> > as part of the reset preparation that parses and potentially modifies contexts.
> > I believe the goal is to handle all of this parsing without getting conflicting
> > latent G2H replies that breaks the preparation flow (that herds active contexts
> > into a fewer set of states as part of reset) - but i will double check
> > with my colleagues.
> > 
Alan: Update i realize the guc-reset-preparation code disable irqs for the guc being reset
prior to this function so in fact, we really ought not to see any change to that xa_array
because of a context-id change that is coming via a interrupt that is orthogonal to this
thread. I will change to xa_lock.

> > > > +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > > > +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > > > +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > > > +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> > > 
> > > This deserves a comment about what lock toggling wants to ensure.
> > > 
> > I realize this might have been my local rebasing mistake, the intention was to
> > handle cases where sched_disable_delay wasn't pending but potentially still
> > executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but
> > not sure if i can call that might-sleep funtion mid reset while not-
> > interruptible). Else, i would move that lock-unlock to if the
> > cancel_delayed_work did not return true (as per original intent before my
> > rebase error).
> 
> Okay a comment like "flush any currently executing do_sched_disable" 
> above the lock toggling would do then. Even better if you add what 
> happens if that flushing isn't done.
> 
> > > Also, if the loops runs with interrupts disabled what is the point of
> > > irqsave variant in here??
> > Yes - its redundant, let me fix that, apologies for that.
> > 
same thing here, a context's guc state should not change in response to an IRQ
from that guc since we disabled it while we are in reset preparatoin
- so actually "not needed" as opposed to "redundant"

> > > Also2, what is the reason for dropping the lock? intel_context_put?
> > Being consistent with other reset preparation code that closes contexts,
> > the lock is dropped before the intel_context_put.
> > (I hope i am not misunderstanding your question).
> 
> Right, okay.. so for locking inversion issues - intel_context_put must 
> not be called with guc_state.lock held, yes?
> 
> Not a mandatory request, but if you want you could look into the option 
> of avoiding lock dropping and instead doing xa_erase and recording the 
> list of contexts for sched disable or put in a local list, and doing 
> them all in one block outside the locked/irq disabled section. Although 
> tbh I am not sure if that would improve anything. Probably not in this 
> case of a reset path, but if there are other places in GuC code which do 
> this it may be beneficial for less hammering on the CPU and core 
> synchronisation for atomics.
> 
apologies my ignorance - perhaps i misunderstood how these functions work but
I assumed that calling kref_get_unless_zero with a non-zero return that lead us
here meant that there is still a ref on the context that didnt come from the
reset path so i am just following the correct rules to release that ref 
and not destroy the contexts yet - but leaving it in the pending-disable
that will be handled in scrub_guc_desc_for_outstanding_g2h that gets called
later in the reset preparation.

Actually i realize that the better option might be to move above code
into the loop already present inside of scrub_guc_desc_for_outstanding_g2h
just prior to its checking of whether a context is pending-disable.
That would ensure we take all these context locks once in the same function
for the herding of all possible states when scrubbing those outstanding g2h.




More information about the Intel-gfx mailing list