[Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Wed Sep 21 07:59:49 UTC 2022
On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
>
>
alan: [snip]
> > > > > + /*
> > > > > + * If the context gets closed while the execbuf is ongoing,
> > > > > the context
> > > > > + * close code will race with the below code to cancel the
> > > > > delayed work.
> > > > > + * If the context close wins the race and cancels the work, it
> > > > > will
> > > > > + * immediately call the sched disable (see guc_context_close),
> > > > > so there
> > > > > + * is a chance we can get past this check while the
> > > > > sched_disable code
> > > > > + * is being executed. To make sure that code completes before
> > > > > we check
> > > > > + * the status further down, we wait for the close process to
> > > > > complete.
> > > > > + */
> > > > > + if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > + intel_context_sched_disable_unpin(ce);
> > > > > + else if (intel_context_is_closed(ce))
> > > > > + wait_for(context_close_done(ce), 1);
> > > >
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > >
> > > will add the consequence (i believe the consequence is that we could
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > >
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent
> > > sometime trying to
> >
> > One millisecond even. :)
>
> Normally 1ms is not a slow time for this. We can only hit the wait if
> the context_close call has already called the cancel_delayed_work, so
> the only thing left to do in that function is to send the H2G, which is
> relatively fast. However, I've just realized that if the H2G buffer is
> full the H2G will stall, so in that case it can take longer (maximum
> stall time before a hang is declared is 1.5s).
>
alan: I'll increase to 1.5 secs
> > What would be the consequence of prematurely reading the still
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes?
> > Something else?
>
> i915 thinks that a request pending with the GuC, while GuC thinks we
> disabled it (because the completion of the delayed_disable happens after
> the new request has been submitted). The heartbeat will still go
> through, so no reset will be triggered, the request is just lost. A new
> request submission on the same context should be able to recover it, but
> we didn't test that.
>
>
> > And why cant' this race with context close happen at any other point
> > than this particular spot in guc_request_alloc? Immediately after the
> > added checks? After atomic_add_unless?
>
> The race is tied to the canceling of the delayed work. context_close
> only does something if it cancels the delayed work, so if the
> cancel_delayed_work_sync here does the cancelling then context_close is
> a no-op.
>
alan: Then i'll add a warn - especially after a 1.5 sec delay and still waiting
for that close.
More information about the Intel-gfx
mailing list