[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Tue Apr 13 07:07:34 UTC 2021

Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky:
> [SNIP]

> emit_fence(fence);
>>>>>>>
>>>>>>> */* We can't wait forever as the HW might be gone at any point*/**
>>>>>>>        dma_fence_wait_timeout(old_fence, 5S);*
>>>>>>>
>>>>>>
>>>>>> You can pretty much ignore this wait here. It is only as a last 
>>>>>> resort so that we never overwrite the ring buffers.
>>>>>
>>>>>
>>>>> If device is present how can I ignore this ?
>>>>>
>>>
>>> I think you missed my question here
>>>
>>
>> Sorry I thought I answered that below.
>>
>> See this is just the last resort so that we don't need to worry about 
>> ring buffer overflows during testing.
>>
>> We should not get here in practice and if we get here generating a 
>> deadlock might actually be the best handling.
>>
>> The alternative would be to call BUG().
>
>
> BTW, I am not sure it's so improbable to get here in case of sudden 
> device remove, if you are during rapid commands submission to the ring 
> during this time  you could easily get to ring buffer overrun because 
> EOP interrupts are gone and fences are not removed anymore but new 
> ones keep arriving from new submissions which don't stop yet.
>

During normal operation hardware fences are only created by two code paths:
1. The scheduler when it pushes jobs to the hardware.
2. The KIQ when it does register access on SRIOV.

Both are limited in how many submissions could be made.

The only case where this here becomes necessary is during GPU reset when 
we do direct submission bypassing the scheduler for IB and other tests.

Christian.

> Andrey
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210413/99c2bb21/attachment-0001.htm>