[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Andrey Grodzovsky andrey.grodzovsky at amd.com
Tue Apr 13 18:30:39 UTC 2021


On 2021-04-13 2:25 p.m., Christian König wrote:
>
>
> Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
>>
>> On 2021-04-13 2:03 p.m., Christian König wrote:
>>> Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2021-04-13 3:10 a.m., Christian König wrote:
>>>>> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2021-04-12 3:18 p.m., Christian König wrote:
>>>>>>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> So what's the right approach ? How we guarantee that when 
>>>>>>>>>> running amdgpu_fence_driver_force_completion we will signal 
>>>>>>>>>> all the HW fences and not racing against some more fences 
>>>>>>>>>> insertion into that array ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well I would still say the best approach would be to insert 
>>>>>>>>> this between the front end and the backend and not rely on 
>>>>>>>>> signaling fences while holding the device srcu.
>>>>>>>>
>>>>>>>>
>>>>>>>> My question is, even now, when we run 
>>>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>>>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
>>>>>>>> what there prevents a race with another fence being at the same 
>>>>>>>> time emitted and inserted into the fence array ? Looks like 
>>>>>>>> nothing.
>>>>>>>>
>>>>>>>
>>>>>>> Each ring can only be used by one thread at the same time, this 
>>>>>>> includes emitting fences as well as other stuff.
>>>>>>>
>>>>>>> During GPU reset we make sure nobody writes to the rings by 
>>>>>>> stopping the scheduler and taking the GPU reset lock (so that 
>>>>>>> nobody else can start the scheduler again).
>>>>>>
>>>>>>
>>>>>> What about direct submissions not through scheduler - 
>>>>>> amdgpu_job_submit_direct, I don't see how this is protected.
>>>>>
>>>>> Those only happen during startup and GPU reset.
>>>>
>>>>
>>>> Ok, but then looks like I am missing something, see the following 
>>>> steps in amdgpu_pci_remove -
>>>>
>>>> 1) Use disable_irq API function to stop and flush all in flight HW 
>>>> interrupts handlers
>>>>
>>>> 2) Grab the reset lock and stop all the schedulers
>>>>
>>>> After above 2 steps the HW fences array is idle, no more insertions 
>>>> and no more extractions from the array
>>>>
>>>> 3) Run one time amdgpu_fence_process to signal all current HW fences
>>>>
>>>> 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
>>>> the GPU reset lock and go on with the rest of the sequence (cancel 
>>>> timers, work items e.t.c)
>>>>
>>>> What's problematic in this sequence ?
>>>
>>> drm_dev_unplug() will wait for the IOCTLs to finish.
>>>
>>> The IOCTLs in turn can wait for fences. That can be both hardware 
>>> fences, scheduler fences, as well as fences from other devices (and 
>>> KIQ fences for register writes under SRIOV, but we can hopefully 
>>> ignore them for now).
>>>
>>> We have handled the hardware fences, but we have no idea when the 
>>> scheduler fences or the fences from other devices will signal.
>>>
>>> Scheduler fences won't signal until the scheduler threads are 
>>> restarted or we somehow cancel the submissions. Doable, but tricky 
>>> as well.
>>
>>
>> For scheduler fences I am not worried, for the sched_fence->finished 
>> fence they are by definition attached to HW fences which already 
>> signaledfor sched_fence->scheduled we should run 
>> drm_sched_entity_kill_jobs for each entity after stopping the 
>> scheduler threads and before setting drm_dev_unplug.
>
> Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
> assumes that there are no more jobs pushed into the entity.
>
> We are racing here once more and need to handle that.


But why, I wrote above that we first stop the all schedulers, then only 
call drm_sched_entity_kill_jobs.


>
>>>
>>> For waiting for other device I have no idea if that couldn't 
>>> deadlock somehow.
>>
>>
>> Yea, not sure for imported fences and dma_bufs, I would assume the 
>> other devices should not be impacted by our device removal but, who 
>> knows...
>>
>> So I guess we are NOT going with finalizing HW fences before 
>> drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
>> back-ends all over the place where there are MMIO accesses ?
>
> Good question. As you said that is really the hard path.
>
> Handling it all at once at IOCTL level certainly has some appeal as 
> well, but I have no idea if we can guarantee that this is lock free.


Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?

Andrey


>
> Christian.
>
>>
>> Andrey
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: Could it be that the device SRCU protects more than one 
>>>>>>>>> device and we deadlock because of this?
>>>>>>>>
>>>>>>>>
>>>>>>>> I haven't actually experienced any deadlock until now but, yes, 
>>>>>>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>>>>>>>> presence  of multiple devices from same or different drivers we 
>>>>>>>> in fact are dependent on all their critical sections i guess.
>>>>>>>>
>>>>>>>
>>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we 
>>>>>>> actually need to sync that up with Daniel and the rest of the 
>>>>>>> i915 guys.
>>>>>>>
>>>>>>> IIRC we could actually have an amdgpu device in a docking 
>>>>>>> station which needs hotplug and the driver might depend on 
>>>>>>> waiting for the i915 driver as well.
>>>>>>
>>>>>>
>>>>>> Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
>>>>>> I don't see why it has to be global and not per device thing.
>>>>>
>>>>> I'm really wondering the same thing for quite a while now.
>>>>>
>>>>> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
>>>>> global.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>     /* Past this point no more fence are submitted to HW 
>>>>>>>>>>>>>> ring and hence we can safely call force signal on all 
>>>>>>>>>>>>>> that are currently there.
>>>>>>>>>>>>>>      * Any subsequently created  HW fences will be 
>>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>>      */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     for_each_ring(adev)
>>>>>>>>>>>>>>         amdgpu_fence_process(ring)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>>>>>>>     Stop schedulers
>>>>>>>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>>>>>>>     hw_fini
>>>>>>>>>>>>>>     unmap_mmio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and 
>>>>>>>>>>>>>>>>>>> stopping and then restarting the scheduler could 
>>>>>>>>>>>>>>>>>>> work as well.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I need to 
>>>>>>>>>>>>>>>>>> reuse the GPU reset rw_lock. I rely on the SRCU 
>>>>>>>>>>>>>>>>>> unplug flag for unplug. Also, not clear to me why are 
>>>>>>>>>>>>>>>>>> we focusing on the scheduler threads, any code patch 
>>>>>>>>>>>>>>>>>> to generate HW fences should be covered, so any code 
>>>>>>>>>>>>>>>>>> leading to amdgpu_fence_emit needs to be taken into 
>>>>>>>>>>>>>>>>>> account such as, direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You need to work together with the reset lock anyway, 
>>>>>>>>>>>>>>>>> cause a hotplug could run at the same time as a reset.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For going my way indeed now I see now that I have to 
>>>>>>>>>>>>>>>> take reset write side lock during HW fences signalling 
>>>>>>>>>>>>>>>> in order to protect against scheduler/HW fences 
>>>>>>>>>>>>>>>> detachment and reattachment during schedulers 
>>>>>>>>>>>>>>>> stop/restart. But if we go with your approach  then 
>>>>>>>>>>>>>>>> calling drm_dev_unplug and scoping amdgpu_job_timeout 
>>>>>>>>>>>>>>>> with drm_dev_enter/exit should be enough to prevent any 
>>>>>>>>>>>>>>>> concurrent GPU resets during unplug. In fact I already 
>>>>>>>>>>>>>>>> do it anyway - 
>>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc7fc6cb505c34aedfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637538946324857369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64362PRC8xTgR2Uj2R256bMegVm8YWq1KI%2BAjzeYXv4%3D&reserved=0 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>


More information about the amd-gfx mailing list