[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Wed Apr 14 14:58:41 UTC 2021

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
>  [SNIP]
>>>>
>>>> We are racing here once more and need to handle that.
>>>
>>>
>>> But why, I wrote above that we first stop the all schedulers, then 
>>> only call drm_sched_entity_kill_jobs.
>>
>> The schedulers consuming jobs is not the problem, we already handle 
>> that correct.
>>
>> The problem is that the entities might continue feeding stuff into 
>> the scheduler.
>
>
> Missed that.  Ok, can I just use non sleeping RCU with a flag around 
> drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
> it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to 
> flush and block in flight and future submissions to entity queue ?

Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.

Christian.

>
>
>>
>>>>
>>>>>>
>>>>>> For waiting for other device I have no idea if that couldn't 
>>>>>> deadlock somehow.
>>>>>
>>>>>
>>>>> Yea, not sure for imported fences and dma_bufs, I would assume the 
>>>>> other devices should not be impacted by our device removal but, 
>>>>> who knows...
>>>>>
>>>>> So I guess we are NOT going with finalizing HW fences before 
>>>>> drm_dev_unplug and instead will just call drm_dev_enter/exit at 
>>>>> the back-ends all over the place where there are MMIO accesses ?
>>>>
>>>> Good question. As you said that is really the hard path.
>>>>
>>>> Handling it all at once at IOCTL level certainly has some appeal as 
>>>> well, but I have no idea if we can guarantee that this is lock free.
>>>
>>>
>>> Maybe just empirically - let's try it and see under different test 
>>> scenarios what actually happens  ?
>>
>> Not a good idea in general, we have that approach way to often at AMD 
>> and are then surprised that everything works in QA but fails in 
>> production.
>>
>> But Daniel already noted in his reply that waiting for a fence while 
>> holding the SRCU is expected to work.
>>
>> So let's stick with the approach of high level locking for hotplug.
>
>
> To my understanding this is true for other devises, not the one being 
> extracted, for him you still need to do all the HW fence signalling 
> dance because the HW is gone and we block any TDRs (which won't help 
> anyway).
>
> Andrey
>
>
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> BTW: Could it be that the device SRCU protects more than 
>>>>>>>>>>>> one device and we deadlock because of this?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I haven't actually experienced any deadlock until now but, 
>>>>>>>>>>> yes, drm_unplug_srcu is defined as static in drm_drv.c and 
>>>>>>>>>>> so in the presence of multiple devices from same or 
>>>>>>>>>>> different drivers we in fact are dependent on all their 
>>>>>>>>>>> critical sections i guess.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we 
>>>>>>>>>> actually need to sync that up with Daniel and the rest of the 
>>>>>>>>>> i915 guys.
>>>>>>>>>>
>>>>>>>>>> IIRC we could actually have an amdgpu device in a docking 
>>>>>>>>>> station which needs hotplug and the driver might depend on 
>>>>>>>>>> waiting for the i915 driver as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can't we propose a patch to make drm_unplug_srcu per 
>>>>>>>>> drm_device ? I don't see why it has to be global and not per 
>>>>>>>>> device thing.
>>>>>>>>
>>>>>>>> I'm really wondering the same thing for quite a while now.
>>>>>>>>
>>>>>>>> Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
>>>>>>>> is global.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     /* Past this point no more fence are submitted to 
>>>>>>>>>>>>>>>>> HW ring and hence we can safely call force signal on 
>>>>>>>>>>>>>>>>> all that are currently there.
>>>>>>>>>>>>>>>>>      * Any subsequently created HW fences will be 
>>>>>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>>>>>      */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     for_each_ring(adev)
>>>>>>>>>>>>>>>>>         amdgpu_fence_process(ring)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>>>>>>>>>>     Stop schedulers
>>>>>>>>>>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>>>>>>>>>>     hw_fini
>>>>>>>>>>>>>>>>>     unmap_mmio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and 
>>>>>>>>>>>>>>>>>>>>>> stopping and then restarting the scheduler could 
>>>>>>>>>>>>>>>>>>>>>> work as well.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I need 
>>>>>>>>>>>>>>>>>>>>> to reuse the GPU reset rw_lock. I rely on the SRCU 
>>>>>>>>>>>>>>>>>>>>> unplug flag for unplug. Also, not clear to me why 
>>>>>>>>>>>>>>>>>>>>> are we focusing on the scheduler threads, any code 
>>>>>>>>>>>>>>>>>>>>> patch to generate HW fences should be covered, so 
>>>>>>>>>>>>>>>>>>>>> any code leading to amdgpu_fence_emit needs to be 
>>>>>>>>>>>>>>>>>>>>> taken into account such as, direct IB submissions, 
>>>>>>>>>>>>>>>>>>>>> VM flushes e.t.c
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You need to work together with the reset lock 
>>>>>>>>>>>>>>>>>>>> anyway, cause a hotplug could run at the same time 
>>>>>>>>>>>>>>>>>>>> as a reset.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For going my way indeed now I see now that I have to 
>>>>>>>>>>>>>>>>>>> take reset write side lock during HW fences 
>>>>>>>>>>>>>>>>>>> signalling in order to protect against scheduler/HW 
>>>>>>>>>>>>>>>>>>> fences detachment and reattachment during schedulers 
>>>>>>>>>>>>>>>>>>> stop/restart. But if we go with your approach then 
>>>>>>>>>>>>>>>>>>> calling drm_dev_unplug and scoping 
>>>>>>>>>>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be 
>>>>>>>>>>>>>>>>>>> enough to prevent any concurrent GPU resets during 
>>>>>>>>>>>>>>>>>>> unplug. In fact I already do it anyway - 
>>>>>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc7fc6cb505c34aedfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637538946324857369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64362PRC8xTgR2Uj2R256bMegVm8YWq1KI%2BAjzeYXv4%3D&reserved=0 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>