[PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Apr 13 18:18:23 UTC 2021
On 2021-04-13 2:03 p.m., Christian König wrote:
> Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
>>
>> On 2021-04-13 3:10 a.m., Christian König wrote:
>>> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2021-04-12 3:18 p.m., Christian König wrote:
>>>>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>>>>> [SNIP]
>>>>>>>>
>>>>>>>> So what's the right approach ? How we guarantee that when
>>>>>>>> running amdgpu_fence_driver_force_completion we will signal all
>>>>>>>> the HW fences and not racing against some more fences insertion
>>>>>>>> into that array ?
>>>>>>>>
>>>>>>>
>>>>>>> Well I would still say the best approach would be to insert this
>>>>>>> between the front end and the backend and not rely on signaling
>>>>>>> fences while holding the device srcu.
>>>>>>
>>>>>>
>>>>>> My question is, even now, when we run
>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
>>>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>>>>>> what there prevents a race with another fence being at the same
>>>>>> time emitted and inserted into the fence array ? Looks like nothing.
>>>>>>
>>>>>
>>>>> Each ring can only be used by one thread at the same time, this
>>>>> includes emitting fences as well as other stuff.
>>>>>
>>>>> During GPU reset we make sure nobody writes to the rings by
>>>>> stopping the scheduler and taking the GPU reset lock (so that
>>>>> nobody else can start the scheduler again).
>>>>
>>>>
>>>> What about direct submissions not through scheduler -
>>>> amdgpu_job_submit_direct, I don't see how this is protected.
>>>
>>> Those only happen during startup and GPU reset.
>>
>>
>> Ok, but then looks like I am missing something, see the following
>> steps in amdgpu_pci_remove -
>>
>> 1) Use disable_irq API function to stop and flush all in flight HW
>> interrupts handlers
>>
>> 2) Grab the reset lock and stop all the schedulers
>>
>> After above 2 steps the HW fences array is idle, no more insertions
>> and no more extractions from the array
>>
>> 3) Run one time amdgpu_fence_process to signal all current HW fences
>>
>> 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release
>> the GPU reset lock and go on with the rest of the sequence (cancel
>> timers, work items e.t.c)
>>
>> What's problematic in this sequence ?
>
> drm_dev_unplug() will wait for the IOCTLs to finish.
>
> The IOCTLs in turn can wait for fences. That can be both hardware
> fences, scheduler fences, as well as fences from other devices (and
> KIQ fences for register writes under SRIOV, but we can hopefully
> ignore them for now).
>
> We have handled the hardware fences, but we have no idea when the
> scheduler fences or the fences from other devices will signal.
>
> Scheduler fences won't signal until the scheduler threads are
> restarted or we somehow cancel the submissions. Doable, but tricky as
> well.
For scheduler fences I am not worried, for the sched_fence->finished
fence they are by definition attached to HW fences which already
signaled,for sched_fence->scheduled we should run
drm_sched_entity_kill_jobs for each entity after stopping the scheduler
threads and before setting drm_dev_unplug.
>
> For waiting for other device I have no idea if that couldn't deadlock
> somehow.
Yea, not sure for imported fences and dma_bufs, I would assume the other
devices should not be impacted by our device removal but, who knows...
So I guess we are NOT going with finalizing HW fences before
drm_dev_unplug and instead will just call drm_dev_enter/exit at the
back-ends all over the place where there are MMIO accesses ?
Andrey
>
> Regards,
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>
>>>>>
>>>>>>>
>>>>>>> BTW: Could it be that the device SRCU protects more than one
>>>>>>> device and we deadlock because of this?
>>>>>>
>>>>>>
>>>>>> I haven't actually experienced any deadlock until now but, yes,
>>>>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
>>>>>> presence of multiple devices from same or different drivers we
>>>>>> in fact are dependent on all their critical sections i guess.
>>>>>>
>>>>>
>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
>>>>> need to sync that up with Daniel and the rest of the i915 guys.
>>>>>
>>>>> IIRC we could actually have an amdgpu device in a docking station
>>>>> which needs hotplug and the driver might depend on waiting for the
>>>>> i915 driver as well.
>>>>
>>>>
>>>> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
>>>> don't see why it has to be global and not per device thing.
>>>
>>> I'm really wondering the same thing for quite a while now.
>>>
>>> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is
>>> global.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> /* Past this point no more fence are submitted to HW
>>>>>>>>>>>> ring and hence we can safely call force signal on all that
>>>>>>>>>>>> are currently there.
>>>>>>>>>>>> * Any subsequently created HW fences will be returned
>>>>>>>>>>>> signaled with an error code right away
>>>>>>>>>>>> */
>>>>>>>>>>>>
>>>>>>>>>>>> for_each_ring(adev)
>>>>>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>>>>>
>>>>>>>>>>>> drm_dev_unplug(dev);
>>>>>>>>>>>> Stop schedulers
>>>>>>>>>>>> cancel_sync(all timers and queued works);
>>>>>>>>>>>> hw_fini
>>>>>>>>>>>> unmap_mmio
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and
>>>>>>>>>>>>>>>>> stopping and then restarting the scheduler could work
>>>>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I need to
>>>>>>>>>>>>>>>> reuse the GPU reset rw_lock. I rely on the SRCU unplug
>>>>>>>>>>>>>>>> flag for unplug. Also, not clear to me why are we
>>>>>>>>>>>>>>>> focusing on the scheduler threads, any code patch to
>>>>>>>>>>>>>>>> generate HW fences should be covered, so any code
>>>>>>>>>>>>>>>> leading to amdgpu_fence_emit needs to be taken into
>>>>>>>>>>>>>>>> account such as, direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You need to work together with the reset lock anyway,
>>>>>>>>>>>>>>> cause a hotplug could run at the same time as a reset.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For going my way indeed now I see now that I have to take
>>>>>>>>>>>>>> reset write side lock during HW fences signalling in
>>>>>>>>>>>>>> order to protect against scheduler/HW fences detachment
>>>>>>>>>>>>>> and reattachment during schedulers stop/restart. But if
>>>>>>>>>>>>>> we go with your approach then calling drm_dev_unplug and
>>>>>>>>>>>>>> scoping amdgpu_job_timeout with drm_dev_enter/exit should
>>>>>>>>>>>>>> be enough to prevent any concurrent GPU resets during
>>>>>>>>>>>>>> unplug. In fact I already do it anyway -
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc7fc6cb505c34aedfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637538946324857369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64362PRC8xTgR2Uj2R256bMegVm8YWq1KI%2BAjzeYXv4%3D&reserved=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
More information about the amd-gfx
mailing list