[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Tue Apr 13 18:25:09 UTC 2021

Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
>
> On 2021-04-13 2:03 p.m., Christian König wrote:
>> Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-04-13 3:10 a.m., Christian König wrote:
>>>> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2021-04-12 3:18 p.m., Christian König wrote:
>>>>>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> So what's the right approach ? How we guarantee that when 
>>>>>>>>> running amdgpu_fence_driver_force_completion we will signal 
>>>>>>>>> all the HW fences and not racing against some more fences 
>>>>>>>>> insertion into that array ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well I would still say the best approach would be to insert 
>>>>>>>> this between the front end and the backend and not rely on 
>>>>>>>> signaling fences while holding the device srcu.
>>>>>>>
>>>>>>>
>>>>>>> My question is, even now, when we run 
>>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
>>>>>>> what there prevents a race with another fence being at the same 
>>>>>>> time emitted and inserted into the fence array ? Looks like 
>>>>>>> nothing.
>>>>>>>
>>>>>>
>>>>>> Each ring can only be used by one thread at the same time, this 
>>>>>> includes emitting fences as well as other stuff.
>>>>>>
>>>>>> During GPU reset we make sure nobody writes to the rings by 
>>>>>> stopping the scheduler and taking the GPU reset lock (so that 
>>>>>> nobody else can start the scheduler again).
>>>>>
>>>>>
>>>>> What about direct submissions not through scheduler - 
>>>>> amdgpu_job_submit_direct, I don't see how this is protected.
>>>>
>>>> Those only happen during startup and GPU reset.
>>>
>>>
>>> Ok, but then looks like I am missing something, see the following 
>>> steps in amdgpu_pci_remove -
>>>
>>> 1) Use disable_irq API function to stop and flush all in flight HW 
>>> interrupts handlers
>>>
>>> 2) Grab the reset lock and stop all the schedulers
>>>
>>> After above 2 steps the HW fences array is idle, no more insertions 
>>> and no more extractions from the array
>>>
>>> 3) Run one time amdgpu_fence_process to signal all current HW fences
>>>
>>> 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
>>> the GPU reset lock and go on with the rest of the sequence (cancel 
>>> timers, work items e.t.c)
>>>
>>> What's problematic in this sequence ?
>>
>> drm_dev_unplug() will wait for the IOCTLs to finish.
>>
>> The IOCTLs in turn can wait for fences. That can be both hardware 
>> fences, scheduler fences, as well as fences from other devices (and 
>> KIQ fences for register writes under SRIOV, but we can hopefully 
>> ignore them for now).
>>
>> We have handled the hardware fences, but we have no idea when the 
>> scheduler fences or the fences from other devices will signal.
>>
>> Scheduler fences won't signal until the scheduler threads are 
>> restarted or we somehow cancel the submissions. Doable, but tricky as 
>> well.
>
>
> For scheduler fences I am not worried, for the sched_fence->finished 
> fence they are by definition attached to HW fences which already 
> signaledfor sched_fence->scheduled we should run 
> drm_sched_entity_kill_jobs for each entity after stopping the 
> scheduler threads and before setting drm_dev_unplug.

Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
assumes that there are no more jobs pushed into the entity.

We are racing here once more and need to handle that.

>>
>> For waiting for other device I have no idea if that couldn't deadlock 
>> somehow.
>
>
> Yea, not sure for imported fences and dma_bufs, I would assume the 
> other devices should not be impacted by our device removal but, who 
> knows...
>
> So I guess we are NOT going with finalizing HW fences before 
> drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
> back-ends all over the place where there are MMIO accesses ?

Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.

Christian.

>
> Andrey
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> BTW: Could it be that the device SRCU protects more than one 
>>>>>>>> device and we deadlock because of this?
>>>>>>>
>>>>>>>
>>>>>>> I haven't actually experienced any deadlock until now but, yes, 
>>>>>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>>>>>>> presence  of multiple devices from same or different drivers we 
>>>>>>> in fact are dependent on all their critical sections i guess.
>>>>>>>
>>>>>>
>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we 
>>>>>> actually need to sync that up with Daniel and the rest of the 
>>>>>> i915 guys.
>>>>>>
>>>>>> IIRC we could actually have an amdgpu device in a docking station 
>>>>>> which needs hotplug and the driver might depend on waiting for 
>>>>>> the i915 driver as well.
>>>>>
>>>>>
>>>>> Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
>>>>> I don't see why it has to be global and not per device thing.
>>>>
>>>> I'm really wondering the same thing for quite a while now.
>>>>
>>>> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
>>>> global.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>     /* Past this point no more fence are submitted to HW 
>>>>>>>>>>>>> ring and hence we can safely call force signal on all that 
>>>>>>>>>>>>> are currently there.
>>>>>>>>>>>>>      * Any subsequently created  HW fences will be 
>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>      */
>>>>>>>>>>>>>
>>>>>>>>>>>>>     for_each_ring(adev)
>>>>>>>>>>>>>         amdgpu_fence_process(ring)
>>>>>>>>>>>>>
>>>>>>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>>>>>>     Stop schedulers
>>>>>>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>>>>>>     hw_fini
>>>>>>>>>>>>>     unmap_mmio
>>>>>>>>>>>>>
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and 
>>>>>>>>>>>>>>>>>> stopping and then restarting the scheduler could work 
>>>>>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I need to 
>>>>>>>>>>>>>>>>> reuse the GPU reset rw_lock. I rely on the SRCU unplug 
>>>>>>>>>>>>>>>>> flag for unplug. Also, not clear to me why are we 
>>>>>>>>>>>>>>>>> focusing on the scheduler threads, any code patch to 
>>>>>>>>>>>>>>>>> generate HW fences should be covered, so any code 
>>>>>>>>>>>>>>>>> leading to amdgpu_fence_emit needs to be taken into 
>>>>>>>>>>>>>>>>> account such as, direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You need to work together with the reset lock anyway, 
>>>>>>>>>>>>>>>> cause a hotplug could run at the same time as a reset.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For going my way indeed now I see now that I have to 
>>>>>>>>>>>>>>> take reset write side lock during HW fences signalling 
>>>>>>>>>>>>>>> in order to protect against scheduler/HW fences 
>>>>>>>>>>>>>>> detachment and reattachment during schedulers 
>>>>>>>>>>>>>>> stop/restart. But if we go with your approach  then 
>>>>>>>>>>>>>>> calling drm_dev_unplug and scoping amdgpu_job_timeout 
>>>>>>>>>>>>>>> with drm_dev_enter/exit should be enough to prevent any 
>>>>>>>>>>>>>>> concurrent GPU resets during unplug. In fact I already 
>>>>>>>>>>>>>>> do it anyway - 
>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc7fc6cb505c34aedfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637538946324857369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64362PRC8xTgR2Uj2R256bMegVm8YWq1KI%2BAjzeYXv4%3D&reserved=0 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>