[Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

Tue Sep 20 14:33:09 UTC 2022

On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:
> 
> 
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:

[snip]

>>>>
>>>>> +    /*
>>>>> +     * If the context gets closed while the execbuf is ongoing, 
>>>>> the context
>>>>> +     * close code will race with the below code to cancel the 
>>>>> delayed work.
>>>>> +     * If the context close wins the race and cancels the work, it 
>>>>> will
>>>>> +     * immediately call the sched disable (see guc_context_close), 
>>>>> so there
>>>>> +     * is a chance we can get past this check while the 
>>>>> sched_disable code
>>>>> +     * is being executed. To make sure that code completes before 
>>>>> we check
>>>>> +     * the status further down, we wait for the close process to 
>>>>> complete.
>>>>> +     */
>>>>> +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>>>> +        intel_context_sched_disable_unpin(ce);
>>>>> +    else if (intel_context_is_closed(ce))
>>>>> +        wait_for(context_close_done(ce), 1);
>>>>
>>>> Comment makes it sounds important to handle the race, althought it
>>>> doesn't really explain the consequences. But most importantly, what if
>>>> close doesn't complete in 1ms?
>>>
>>> will add the consequence (i believe the consequence is that we could 
>>> prematurely
>>> read context flags bit indicating its gucid is still registered and 
>>> after skipping
>>> re-registration, find that context gets closed and guc-id freed).
>>>
>>> Yes the 1 second is arbitrary and unnervingly short. Just spent 
>>> sometime trying to
>>
>> One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
>>
>> What would be the consequence of prematurely reading the still 
>> registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
>> Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
>>
>> And why cant' this race with context close happen at any other point 
>> than this particular spot in guc_request_alloc? Immediately after the 
>> added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.

It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.

*)
intel_context_close
   set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
   ->guc_context_close
	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
		  ((which is:
			spin_lock_irqsave(&ce->guc_state.lock, flags);
			..stuff..
			unlock
		  ))

	spin_lock_irqsave(&ce->guc_state.lock, flags);
	set_context_close_done(ce);
	spin_unlock_irqrestore(&ce->guc_state.lock, flags);

Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.

Regards,

Tvrtko