[Intel-xe] [PATCH v2 2/2] drm/xe/guc_submit: fixup deregister in job timeout
Matthew Auld
matthew.auld at intel.com
Fri Aug 4 15:03:09 UTC 2023
On 04/08/2023 14:37, Matthew Brost wrote:
> On Fri, Aug 04, 2023 at 09:48:30AM +0100, Matthew Auld wrote:
>> On 03/08/2023 19:32, Matthew Brost wrote:
>>> On Thu, Aug 03, 2023 at 06:38:51PM +0100, Matthew Auld wrote:
>>>> Rather check if the engine is still registered before proceeding with
>>>> deregister steps. Also the engine being marked as disabled doesn't mean
>>>> the engine has been disabled or deregistered from GuC pov, and here we
>>>> are signalling fences so we need to be sure GuC is not still using this
>>>> context.
>>>>
>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_guc_submit.c | 8 +++++---
>>>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> index b88bfe7d8470..e499e6540ca5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> @@ -881,15 +881,17 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>>>> }
>>>> /* Engine state now stable, disable scheduling if needed */
>>>> - if (exec_queue_enabled(q)) {
>>>> + if (exec_queue_registered(q)) {
>>>> struct xe_guc *guc = exec_queue_to_guc(q);
>>>> int ret;
>>>> if (exec_queue_reset(q))
>>>> err = -EIO;
>>>> set_exec_queue_banned(q);
>>>> - xe_exec_queue_get(q);
>>>> - disable_scheduling_deregister(guc, q);
>>>> + if (!exec_queue_destroyed(q)) {
>>>> + xe_exec_queue_get(q);
>>>> + disable_scheduling_deregister(guc, q);
>>>
>>> You could include wait under this if statment too but either way works.
>>
>> Do you mean move the pending_disable wait under the if? My worry is that
>
> Yea.
>
>> multiple queued timeout jobs could somehow trigger one after the other and
>> the first disable_scheduling_deregister() goes bad triggering a timeout for
>> the wait and queuing a GT reset. The GT reset looks to use the same ordered
>> wq as the timeout jobs, so it might be that another timeout job was queued
>> before the reset job (like when doing the ~5 second wait). If that happens
>> the second timeout job would see that exec_queue_destroyed has been seen and
>> incorrectly not wait for the pending_disable state change and then start
>> signalling fences even though the GuC might still be using the context. Do
>> you know if that is possible?
>
> Typical once a GT reset is issued the pending disable state change isn't
> going to happen a the GuC is dead, rather guc_read_stopped() is true
> which indicates a GT reset is pending in the ordered WQ and it safe to
> immediately cleanup any jobs that have timed out. If multiple timeouts
> occur before processing the GT reset (all of these are on the same queue
> so have mutual exclusion on execution) that is fine. The only way the
> first timeout can make progess is the GuC responds and does the correct
> thing or a GT reset is queued.
Ohh, I missed the xe_uc_reset_prepare() in xe_gt_reset_async. But if
multiple timeout jobs are injected before the queued GT reset, I don't
see why it is safe to start signalling fences here. We don't know the
current state of the hw/guc, so if something goes wrong with
deregistation here I would have thought only safe point is when the
engine is no longer registered/enabled from GuC pov, which should be
taken care of when doing the actual GT reset, so after flushing CTB
stuff and calling into guc_exec_queue_stop(). Like say if we were to
just drop the read_stopped() for this case?
>
> Matt
>
>>
>>>
>>> With that:
>>> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
>>>
>>>> + }
>>>> /*
>>>> * Must wait for scheduling to be disabled before signalling
>>>> --
>>>> 2.41.0
>>>>
More information about the Intel-xe
mailing list