[Intel-xe] [PATCH v2 2/2] drm/xe/guc_submit: fixup deregister in job timeout

Fri Aug 4 15:03:09 UTC 2023

On 04/08/2023 14:37, Matthew Brost wrote:
> On Fri, Aug 04, 2023 at 09:48:30AM +0100, Matthew Auld wrote:
>> On 03/08/2023 19:32, Matthew Brost wrote:
>>> On Thu, Aug 03, 2023 at 06:38:51PM +0100, Matthew Auld wrote:
>>>> Rather check if the engine is still registered before proceeding with
>>>> deregister steps. Also the engine being marked as disabled doesn't mean
>>>> the engine has been disabled or deregistered from GuC pov, and here we
>>>> are signalling fences so we need to be sure GuC is not still using this
>>>> context.
>>>>
>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_guc_submit.c | 8 +++++---
>>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> index b88bfe7d8470..e499e6540ca5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> @@ -881,15 +881,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>>>>    	}
>>>>    	/* Engine state now stable, disable scheduling if needed */
>>>> -	if (exec_queue_enabled(q)) {
>>>> +	if (exec_queue_registered(q)) {
>>>>    		struct xe_guc *guc = exec_queue_to_guc(q);
>>>>    		int ret;
>>>>    		if (exec_queue_reset(q))
>>>>    			err = -EIO;
>>>>    		set_exec_queue_banned(q);
>>>> -		xe_exec_queue_get(q);
>>>> -		disable_scheduling_deregister(guc, q);
>>>> +		if (!exec_queue_destroyed(q)) {
>>>> +			xe_exec_queue_get(q);
>>>> +			disable_scheduling_deregister(guc, q);
>>>
>>> You could include wait under this if statment too but either way works.
>>
>> Do you mean move the pending_disable wait under the if? My worry is that
> 
> Yea.
> 
>> multiple queued timeout jobs could somehow trigger one after the other and
>> the first disable_scheduling_deregister() goes bad triggering a timeout for
>> the wait and queuing a GT reset. The GT reset looks to use the same ordered
>> wq as the timeout jobs, so it might be that another timeout job was queued
>> before the reset job (like when doing the ~5 second wait). If that happens
>> the second timeout job would see that exec_queue_destroyed has been seen and
>> incorrectly not wait for the pending_disable state change and then start
>> signalling fences even though the GuC might still be using the context. Do
>> you know if that is possible?
> 
> Typical once a GT reset is issued the pending disable state change isn't
> going to happen a the GuC is dead, rather guc_read_stopped() is true
> which indicates a GT reset is pending in the ordered WQ and it safe to
> immediately cleanup any jobs that have timed out. If multiple timeouts
> occur before processing the GT reset (all of these are on the same queue
> so have mutual exclusion on execution) that is fine. The only way the
> first timeout can make progess is the GuC responds and does the correct
> thing or a GT reset is queued.

Ohh, I missed the xe_uc_reset_prepare() in xe_gt_reset_async. But if 
multiple timeout jobs are injected before the queued GT reset, I don't 
see why it is safe to start signalling fences here. We don't know the 
current state of the hw/guc, so if something goes wrong with 
deregistation here I would have thought only safe point is when the 
engine is no longer registered/enabled from GuC pov, which should be 
taken care of when doing the actual GT reset, so after flushing CTB 
stuff and calling into guc_exec_queue_stop(). Like say if we were to 
just drop the read_stopped() for this case?

> 
> Matt
> 
>>
>>>
>>> With that:
>>> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
>>>
>>>> +		}
>>>>    		/*
>>>>    		 * Must wait for scheduling to be disabled before signalling
>>>> -- 
>>>> 2.41.0
>>>>