[Intel-xe] [PATCH v2 2/2] drm/xe/guc_submit: fixup deregister in job timeout

Fri Aug 4 13:37:46 UTC 2023

On Fri, Aug 04, 2023 at 09:48:30AM +0100, Matthew Auld wrote:
> On 03/08/2023 19:32, Matthew Brost wrote:
> > On Thu, Aug 03, 2023 at 06:38:51PM +0100, Matthew Auld wrote:
> > > Rather check if the engine is still registered before proceeding with
> > > deregister steps. Also the engine being marked as disabled doesn't mean
> > > the engine has been disabled or deregistered from GuC pov, and here we
> > > are signalling fences so we need to be sure GuC is not still using this
> > > context.
> > > 
> > > Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_guc_submit.c | 8 +++++---
> > >   1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index b88bfe7d8470..e499e6540ca5 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -881,15 +881,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > >   	}
> > >   	/* Engine state now stable, disable scheduling if needed */
> > > -	if (exec_queue_enabled(q)) {
> > > +	if (exec_queue_registered(q)) {
> > >   		struct xe_guc *guc = exec_queue_to_guc(q);
> > >   		int ret;
> > >   		if (exec_queue_reset(q))
> > >   			err = -EIO;
> > >   		set_exec_queue_banned(q);
> > > -		xe_exec_queue_get(q);
> > > -		disable_scheduling_deregister(guc, q);
> > > +		if (!exec_queue_destroyed(q)) {
> > > +			xe_exec_queue_get(q);
> > > +			disable_scheduling_deregister(guc, q);
> > 
> > You could include wait under this if statment too but either way works.
> 
> Do you mean move the pending_disable wait under the if? My worry is that

Yea.

> multiple queued timeout jobs could somehow trigger one after the other and
> the first disable_scheduling_deregister() goes bad triggering a timeout for
> the wait and queuing a GT reset. The GT reset looks to use the same ordered
> wq as the timeout jobs, so it might be that another timeout job was queued
> before the reset job (like when doing the ~5 second wait). If that happens
> the second timeout job would see that exec_queue_destroyed has been seen and
> incorrectly not wait for the pending_disable state change and then start
> signalling fences even though the GuC might still be using the context. Do
> you know if that is possible?

Typical once a GT reset is issued the pending disable state change isn't
going to happen a the GuC is dead, rather guc_read_stopped() is true
which indicates a GT reset is pending in the ordered WQ and it safe to
immediately cleanup any jobs that have timed out. If multiple timeouts
occur before processing the GT reset (all of these are on the same queue
so have mutual exclusion on execution) that is fine. The only way the
first timeout can make progess is the GuC responds and does the correct
thing or a GT reset is queued.

Matt 

> 
> > 
> > With that:
> > Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> > 
> > > +		}
> > >   		/*
> > >   		 * Must wait for scheduling to be disabled before signalling
> > > -- 
> > > 2.41.0
> > >