[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Tue Apr 13 07:10:28 UTC 2021

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>
> On 2021-04-12 3:18 p.m., Christian König wrote:
>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>>
>>>>> So what's the right approach ? How we guarantee that when running 
>>>>> amdgpu_fence_driver_force_completion we will signal all the HW 
>>>>> fences and not racing against some more fences insertion into that 
>>>>> array ?
>>>>>
>>>>
>>>> Well I would still say the best approach would be to insert this 
>>>> between the front end and the backend and not rely on signaling 
>>>> fences while holding the device srcu.
>>>
>>>
>>> My question is, even now, when we run 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
>>> what there prevents a race with another fence being at the same time 
>>> emitted and inserted into the fence array ? Looks like nothing.
>>>
>>
>> Each ring can only be used by one thread at the same time, this 
>> includes emitting fences as well as other stuff.
>>
>> During GPU reset we make sure nobody writes to the rings by stopping 
>> the scheduler and taking the GPU reset lock (so that nobody else can 
>> start the scheduler again).
>
>
> What about direct submissions not through scheduler - 
> amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.

>>
>>>>
>>>> BTW: Could it be that the device SRCU protects more than one device 
>>>> and we deadlock because of this?
>>>
>>>
>>> I haven't actually experienced any deadlock until now but, yes, 
>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>>> presence  of multiple devices from same or different drivers we in 
>>> fact are dependent on all their critical sections i guess.
>>>
>>
>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
>> need to sync that up with Daniel and the rest of the i915 guys.
>>
>> IIRC we could actually have an amdgpu device in a docking station 
>> which needs hotplug and the driver might depend on waiting for the 
>> i915 driver as well.
>
>
> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
> don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.

>
> Andrey
>
>
>>
>> Christian.
>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>     /* Past this point no more fence are submitted to HW ring 
>>>>>>>>> and hence we can safely call force signal on all that are 
>>>>>>>>> currently there.
>>>>>>>>>      * Any subsequently created  HW fences will be returned 
>>>>>>>>> signaled with an error code right away
>>>>>>>>>      */
>>>>>>>>>
>>>>>>>>>     for_each_ring(adev)
>>>>>>>>>         amdgpu_fence_process(ring)
>>>>>>>>>
>>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>>     Stop schedulers
>>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>>     hw_fini
>>>>>>>>>     unmap_mmio
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping 
>>>>>>>>>>>>>> and then restarting the scheduler could work as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse 
>>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for 
>>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the 
>>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences 
>>>>>>>>>>>>> should be covered, so any code leading to 
>>>>>>>>>>>>> amdgpu_fence_emit needs to be taken into account such as, 
>>>>>>>>>>>>> direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>
>>>>>>>>>>>> You need to work together with the reset lock anyway, cause 
>>>>>>>>>>>> a hotplug could run at the same time as a reset.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For going my way indeed now I see now that I have to take 
>>>>>>>>>>> reset write side lock during HW fences signalling in order 
>>>>>>>>>>> to protect against scheduler/HW fences detachment and 
>>>>>>>>>>> reattachment during schedulers stop/restart. But if we go 
>>>>>>>>>>> with your approach  then calling drm_dev_unplug and scoping 
>>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough 
>>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact 
>>>>>>>>>>> I already do it anyway - 
>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&reserved=0
>>>>>>>>>>
>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>