[Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Sep 20 14:33:09 UTC 2022
On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:
>
>
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
[snip]
>>>>
>>>>> + /*
>>>>> + * If the context gets closed while the execbuf is ongoing,
>>>>> the context
>>>>> + * close code will race with the below code to cancel the
>>>>> delayed work.
>>>>> + * If the context close wins the race and cancels the work, it
>>>>> will
>>>>> + * immediately call the sched disable (see guc_context_close),
>>>>> so there
>>>>> + * is a chance we can get past this check while the
>>>>> sched_disable code
>>>>> + * is being executed. To make sure that code completes before
>>>>> we check
>>>>> + * the status further down, we wait for the close process to
>>>>> complete.
>>>>> + */
>>>>> + if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>>>> + intel_context_sched_disable_unpin(ce);
>>>>> + else if (intel_context_is_closed(ce))
>>>>> + wait_for(context_close_done(ce), 1);
>>>>
>>>> Comment makes it sounds important to handle the race, althought it
>>>> doesn't really explain the consequences. But most importantly, what if
>>>> close doesn't complete in 1ms?
>>>
>>> will add the consequence (i believe the consequence is that we could
>>> prematurely
>>> read context flags bit indicating its gucid is still registered and
>>> after skipping
>>> re-registration, find that context gets closed and guc-id freed).
>>>
>>> Yes the 1 second is arbitrary and unnervingly short. Just spent
>>> sometime trying to
>>
>> One millisecond even. :)
>
> Normally 1ms is not a slow time for this. We can only hit the wait if
> the context_close call has already called the cancel_delayed_work, so
> the only thing left to do in that function is to send the H2G, which is
> relatively fast. However, I've just realized that if the H2G buffer is
> full the H2G will stall, so in that case it can take longer (maximum
> stall time before a hang is declared is 1.5s).
>
>>
>> What would be the consequence of prematurely reading the still
>> registered context flag? Execbuf fails? GuC hangs? Kernel crashes?
>> Something else?
>
> i915 thinks that a request pending with the GuC, while GuC thinks we
> disabled it (because the completion of the delayed_disable happens after
> the new request has been submitted). The heartbeat will still go
> through, so no reset will be triggered, the request is just lost. A new
> request submission on the same context should be able to recover it, but
> we didn't test that.
>
>
>>
>> And why cant' this race with context close happen at any other point
>> than this particular spot in guc_request_alloc? Immediately after the
>> added checks? After atomic_add_unless?
>
> The race is tied to the canceling of the delayed work. context_close
> only does something if it cancels the delayed work, so if the
> cancel_delayed_work_sync here does the cancelling then context_close is
> a no-op.
It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.
*)
intel_context_close
set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
->guc_context_close
if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
cancel_delayed_work(&ce->guc_state.sched_disable_delay))
__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
((which is:
spin_lock_irqsave(&ce->guc_state.lock, flags);
..stuff..
unlock
))
spin_lock_irqsave(&ce->guc_state.lock, flags);
set_context_close_done(ce);
spin_unlock_irqrestore(&ce->guc_state.lock, flags);
Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list