[PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Mon Apr 12 20:01:07 UTC 2021
On 2021-04-12 3:18 p.m., Christian König wrote:
> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>> [SNIP]
>>>>
>>>> So what's the right approach ? How we guarantee that when running
>>>> amdgpu_fence_driver_force_completion we will signal all the HW
>>>> fences and not racing against some more fences insertion into that
>>>> array ?
>>>>
>>>
>>> Well I would still say the best approach would be to insert this
>>> between the front end and the backend and not rely on signaling
>>> fences while holding the device srcu.
>>
>>
>> My question is, even now, when we run
>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>> what there prevents a race with another fence being at the same time
>> emitted and inserted into the fence array ? Looks like nothing.
>>
>
> Each ring can only be used by one thread at the same time, this
> includes emitting fences as well as other stuff.
>
> During GPU reset we make sure nobody writes to the rings by stopping
> the scheduler and taking the GPU reset lock (so that nobody else can
> start the scheduler again).
What about direct submissions not through scheduler -
amdgpu_job_submit_direct, I don't see how this is protected.
>
>>>
>>> BTW: Could it be that the device SRCU protects more than one device
>>> and we deadlock because of this?
>>
>>
>> I haven't actually experienced any deadlock until now but, yes,
>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
>> presence of multiple devices from same or different drivers we in
>> fact are dependent on all their critical sections i guess.
>>
>
> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> need to sync that up with Daniel and the rest of the i915 guys.
>
> IIRC we could actually have an amdgpu device in a docking station
> which needs hotplug and the driver might depend on waiting for the
> i915 driver as well.
Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
don't see why it has to be global and not per device thing.
Andrey
>
> Christian.
>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> /* Past this point no more fence are submitted to HW ring
>>>>>>>> and hence we can safely call force signal on all that are
>>>>>>>> currently there.
>>>>>>>> * Any subsequently created HW fences will be returned
>>>>>>>> signaled with an error code right away
>>>>>>>> */
>>>>>>>>
>>>>>>>> for_each_ring(adev)
>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>
>>>>>>>> drm_dev_unplug(dev);
>>>>>>>> Stop schedulers
>>>>>>>> cancel_sync(all timers and queued works);
>>>>>>>> hw_fini
>>>>>>>> unmap_mmio
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping
>>>>>>>>>>>>> and then restarting the scheduler could work as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse
>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for
>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the
>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences
>>>>>>>>>>>> should be covered, so any code leading to amdgpu_fence_emit
>>>>>>>>>>>> needs to be taken into account such as, direct IB
>>>>>>>>>>>> submissions, VM flushes e.t.c
>>>>>>>>>>>
>>>>>>>>>>> You need to work together with the reset lock anyway, cause
>>>>>>>>>>> a hotplug could run at the same time as a reset.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For going my way indeed now I see now that I have to take
>>>>>>>>>> reset write side lock during HW fences signalling in order to
>>>>>>>>>> protect against scheduler/HW fences detachment and
>>>>>>>>>> reattachment during schedulers stop/restart. But if we go
>>>>>>>>>> with your approach then calling drm_dev_unplug and scoping
>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough
>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact I
>>>>>>>>>> already do it anyway -
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&reserved=0
>>>>>>>>>
>>>>>>>>> Yes, good point as well.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
More information about the amd-gfx
mailing list