[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Tue Apr 13 09:14:56 UTC 2021

Hi Dennis,

yeah, that just has the same down side of a lot of additional overhead 
as the is_signaled callback.

Bouncing cache lines on the CPU isn't funny at all.

Christian.

Am 13.04.21 um 11:13 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian and Andrey,
>        We maybe try to implement "wait" callback function of dma_fence_ops, when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller device lost.
>
> 	 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
> 	 * interrupted, and remaining jiffies if fence has signaled, or 0 if wait
> 	 * timed out. Can also return other error values on custom implementations,
> 	 * which should be treated as if the fence is signaled. For example a hardware
> 	 * lockup could be reported like that.
> 	 *
> 	 * This callback is optional.
> 	 */
> 	signed long (*wait)(struct dma_fence *fence,
> 			    bool intr, signed long timeout);
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Tuesday, April 13, 2021 3:10 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Daniel Vetter <daniel at ffwll.ch>
> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>> On 2021-04-12 3:18 p.m., Christian König wrote:
>>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>>>> So what's the right approach ? How we guarantee that when running
>>>>>> amdgpu_fence_driver_force_completion we will signal all the HW
>>>>>> fences and not racing against some more fences insertion into that
>>>>>> array ?
>>>>>>
>>>>> Well I would still say the best approach would be to insert this
>>>>> between the front end and the backend and not rely on signaling
>>>>> fences while holding the device srcu.
>>>>
>>>> My question is, even now, when we run
>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
>>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>>>> what there prevents a race with another fence being at the same time
>>>> emitted and inserted into the fence array ? Looks like nothing.
>>>>
>>> Each ring can only be used by one thread at the same time, this
>>> includes emitting fences as well as other stuff.
>>>
>>> During GPU reset we make sure nobody writes to the rings by stopping
>>> the scheduler and taking the GPU reset lock (so that nobody else can
>>> start the scheduler again).
>>
>> What about direct submissions not through scheduler -
>> amdgpu_job_submit_direct, I don't see how this is protected.
> Those only happen during startup and GPU reset.
>
>>>>> BTW: Could it be that the device SRCU protects more than one device
>>>>> and we deadlock because of this?
>>>>
>>>> I haven't actually experienced any deadlock until now but, yes,
>>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
>>>> presence  of multiple devices from same or different drivers we in
>>>> fact are dependent on all their critical sections i guess.
>>>>
>>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
>>> need to sync that up with Daniel and the rest of the i915 guys.
>>>
>>> IIRC we could actually have an amdgpu device in a docking station
>>> which needs hotplug and the driver might depend on waiting for the
>>> i915 driver as well.
>>
>> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
>> don't see why it has to be global and not per device thing.
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.
>
> Regards,
> Christian.
>
>> Andrey
>>
>>
>>> Christian.
>>>
>>>> Andrey
>>>>
>>>>
>>>>> Christian.
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>      /* Past this point no more fence are submitted to HW ring
>>>>>>>>>> and hence we can safely call force signal on all that are
>>>>>>>>>> currently there.
>>>>>>>>>>       * Any subsequently created  HW fences will be returned
>>>>>>>>>> signaled with an error code right away
>>>>>>>>>>       */
>>>>>>>>>>
>>>>>>>>>>      for_each_ring(adev)
>>>>>>>>>>          amdgpu_fence_process(ring)
>>>>>>>>>>
>>>>>>>>>>      drm_dev_unplug(dev);
>>>>>>>>>>      Stop schedulers
>>>>>>>>>>      cancel_sync(all timers and queued works);
>>>>>>>>>>      hw_fini
>>>>>>>>>>      unmap_mmio
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping
>>>>>>>>>>>>>>> and then restarting the scheduler could work as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse
>>>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for
>>>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the
>>>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences
>>>>>>>>>>>>>> should be covered, so any code leading to
>>>>>>>>>>>>>> amdgpu_fence_emit needs to be taken into account such as,
>>>>>>>>>>>>>> direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>> You need to work together with the reset lock anyway, cause
>>>>>>>>>>>>> a hotplug could run at the same time as a reset.
>>>>>>>>>>>>
>>>>>>>>>>>> For going my way indeed now I see now that I have to take
>>>>>>>>>>>> reset write side lock during HW fences signalling in order
>>>>>>>>>>>> to protect against scheduler/HW fences detachment and
>>>>>>>>>>>> reattachment during schedulers stop/restart. But if we go
>>>>>>>>>>>> with your approach  then calling drm_dev_unplug and scoping
>>>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough
>>>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact
>>>>>>>>>>>> I already do it anyway -
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2
>>>>>>>>>>>> F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh
>>>>>>>>>>>> %3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc
>>>>>>>>>>>> 36b1&data=04%7C01%7CDennis.Li%40amd.com%7Cc7fc6cb505c34a
>>>>>>>>>>>> edfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
>>>>>>>>>>>> 0%7C637538946323194151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
>>>>>>>>>>>> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000
>>>>>>>>>>>> &sdata=%2Fe%2BqJNlcuUjLHsLvfHCKqerK%2Ff8lzujqOBhnMBIRP8E
>>>>>>>>>>>> %3D&reserved=0
>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>