[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Mon Apr 12 18:05:16 UTC 2021

Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
>
> On 2021-04-12 1:44 p.m., Christian König wrote:
>
>>
>> Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
>>> On 2021-04-10 1:34 p.m., Christian König wrote:
>>>> Hi Andrey,
>>>>
>>>> Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>>>>
>>>>>>> If we use a list and a flag called 'emit_allowed' under a lock 
>>>>>>> such that in amdgpu_fence_emit we lock the list, check the flag 
>>>>>>> and if true add the new HW fence to list and proceed to HW 
>>>>>>> emition as normal, otherwise return with -ENODEV. In 
>>>>>>> amdgpu_pci_remove we take the lock, set the flag to false, and 
>>>>>>> then iterate the list and force signal it. Will this not prevent 
>>>>>>> any new HW fence creation from now on from any place trying to 
>>>>>>> do so ?
>>>>>>
>>>>>> Way to much overhead. The fence processing is intentionally lock 
>>>>>> free to avoid cache line bouncing because the IRQ can move from 
>>>>>> CPU to CPU.
>>>>>>
>>>>>> We need something which at least the processing of fences in the 
>>>>>> interrupt handler doesn't affect at all.
>>>>>
>>>>>
>>>>> As far as I see in the code, amdgpu_fence_emit is only called from 
>>>>> task context. Also, we can skip this list I proposed and just use 
>>>>> amdgpu_fence_driver_force_completion for each ring to signal all 
>>>>> created HW fences.
>>>>
>>>> Ah, wait a second this gave me another idea.
>>>>
>>>> See amdgpu_fence_driver_force_completion():
>>>>
>>>> amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>>>>
>>>> If we change that to something like:
>>>>
>>>> amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFFFFFF);
>>>>
>>>> Not only the currently submitted, but also the next 0x3FFFFFFF 
>>>> fences will be considered signaled.
>>>>
>>>> This basically solves out problem of making sure that new fences 
>>>> are also signaled without any additional overhead whatsoever.
>>>
>>>
>>> Problem with this is that the act of setting the sync_seq to some 
>>> MAX value alone is not enough, you actually have to call 
>>> amdgpu_fence_process to iterate and signal the fences currently 
>>> stored in ring->fence_drv.fences array and to guarantee that once 
>>> you done your signalling no more HW fences will be added to that 
>>> array anymore. I was thinking to do something like bellow:
>>>
>>
>> Well we could implement the is_signaled callback once more, but I'm 
>> not sure if that is a good idea.
>
>
> This indeed could save the explicit signaling I am doing bellow but I 
> also set an error code there which might be helpful to propagate to users
>
>
>>
>>> amdgpu_fence_emit()
>>>
>>> {
>>>
>>>     dma_fence_init(fence);
>>>
>>>     srcu_read_lock(amdgpu_unplug_srcu)
>>>
>>>     if (!adev->unplug)) {
>>>
>>>         seq = ++ring->fence_drv.sync_seq;
>>>         emit_fence(fence);
>>>
>>> */* We can't wait forever as the HW might be gone at any point*/**
>>>        dma_fence_wait_timeout(old_fence, 5S);*
>>>
>>
>> You can pretty much ignore this wait here. It is only as a last 
>> resort so that we never overwrite the ring buffers.
>
>
> If device is present how can I ignore this ?
>
>
>>
>> But it should not have a timeout as far as I can see.
>
>
> Without timeout wait the who approach falls apart as I can't call 
> srcu_synchronize on this scope because once device is physically gone 
> the wait here will be forever
>

Yeah, but this is intentional. The only alternative to avoid corruption 
is to wait with a timeout and call BUG() if that triggers. That isn't 
much better.

>
>>
>>>         ring->fence_drv.fences[seq & 
>>> ring->fence_drv.num_fences_mask] = fence;
>>>
>>>     } else {
>>>
>>>         dma_fence_set_error(fence, -ENODEV);
>>>         DMA_fence_signal(fence)
>>>
>>>     }
>>>
>>>     srcu_read_unlock(amdgpu_unplug_srcu)
>>>     return fence;
>>>
>>> }
>>>
>>> amdgpu_pci_remove
>>>
>>> {
>>>
>>>     adev->unplug = true;
>>>     synchronize_srcu(amdgpu_unplug_srcu)
>>>
>>
>> Well that is just duplicating what drm_dev_unplug() should be doing 
>> on a different level.
>
>
> drm_dev_unplug is on a much wider scope, for everything in the device 
> including 'flushing' in flight IOCTLs, this deals specifically with 
> the issue of force signalling HW fences
>

Yeah, but it adds the same overhead as the device srcu.

Christian.

> Andrey
>
>
>>
>> Christian.
>>
>>>     /* Past this point no more fence are submitted to HW ring and 
>>> hence we can safely call force signal on all that are currently there.
>>>      * Any subsequently created  HW fences will be returned signaled 
>>> with an error code right away
>>>      */
>>>
>>>     for_each_ring(adev)
>>>         amdgpu_fence_process(ring)
>>>
>>>     drm_dev_unplug(dev);
>>>     Stop schedulers
>>>     cancel_sync(all timers and queued works);
>>>     hw_fini
>>>     unmap_mmio
>>>
>>> }
>>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>>>>
>>>>>>>> Alternatively grabbing the reset write side and stopping and 
>>>>>>>> then restarting the scheduler could work as well.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>>
>>>>>>> I didn't get the above and I don't see why I need to reuse the 
>>>>>>> GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. 
>>>>>>> Also, not clear to me why are we focusing on the scheduler 
>>>>>>> threads, any code patch to generate HW fences should be covered, 
>>>>>>> so any code leading to amdgpu_fence_emit needs to be taken into 
>>>>>>> account such as, direct IB submissions, VM flushes e.t.c
>>>>>>
>>>>>> You need to work together with the reset lock anyway, cause a 
>>>>>> hotplug could run at the same time as a reset.
>>>>>
>>>>>
>>>>> For going my way indeed now I see now that I have to take reset 
>>>>> write side lock during HW fences signalling in order to protect 
>>>>> against scheduler/HW fences detachment and reattachment during 
>>>>> schedulers stop/restart. But if we go with your approach  then 
>>>>> calling drm_dev_unplug and scoping amdgpu_job_timeout with 
>>>>> drm_dev_enter/exit should be enough to prevent any concurrent GPU 
>>>>> resets during unplug. In fact I already do it anyway - 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&reserved=0
>>>>
>>>> Yes, good point as well.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210412/074f0cc8/attachment-0001.htm>