[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Tue Apr 13 20:07:38 UTC 2021

On Tue, Apr 13, 2021 at 9:10 AM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >>>>>
> >>>>> So what's the right approach ? How we guarantee that when running
> >>>>> amdgpu_fence_driver_force_completion we will signal all the HW
> >>>>> fences and not racing against some more fences insertion into that
> >>>>> array ?
> >>>>>
> >>>>
> >>>> Well I would still say the best approach would be to insert this
> >>>> between the front end and the backend and not rely on signaling
> >>>> fences while holding the device srcu.
> >>>
> >>>
> >>> My question is, even now, when we run
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
> >>> what there prevents a race with another fence being at the same time
> >>> emitted and inserted into the fence array ? Looks like nothing.
> >>>
> >>
> >> Each ring can only be used by one thread at the same time, this
> >> includes emitting fences as well as other stuff.
> >>
> >> During GPU reset we make sure nobody writes to the rings by stopping
> >> the scheduler and taking the GPU reset lock (so that nobody else can
> >> start the scheduler again).
> >
> >
> > What about direct submissions not through scheduler -
> > amdgpu_job_submit_direct, I don't see how this is protected.
>
> Those only happen during startup and GPU reset.
>
> >>
> >>>>
> >>>> BTW: Could it be that the device SRCU protects more than one device
> >>>> and we deadlock because of this?
> >>>
> >>>
> >>> I haven't actually experienced any deadlock until now but, yes,
> >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
> >>> presence  of multiple devices from same or different drivers we in
> >>> fact are dependent on all their critical sections i guess.
> >>>
> >>
> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> >> need to sync that up with Daniel and the rest of the i915 guys.
> >>
> >> IIRC we could actually have an amdgpu device in a docking station
> >> which needs hotplug and the driver might depend on waiting for the
> >> i915 driver as well.
> >
> >
> > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> > don't see why it has to be global and not per device thing.
>
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

SRCU isn't exactly the cheapest thing, but aside from that we could
make it per-device. I'm not seeing the point much since if you do end
up being stuck on an ioctl this might happen with anything really.

Also note that dma_fence_waits are supposed to be time bound, so you
shouldn't end up waiting on them forever. It should all get sorted out
in due time with TDR I hope (e.g. if i915 is stuck on a fence because
you're unlucky).
-Daniel

>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> >>
> >> Christian.
> >>
> >>> Andrey
> >>>
> >>>
> >>>>
> >>>> Christian.
> >>>>
> >>>>> Andrey
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> Andrey
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>>>     /* Past this point no more fence are submitted to HW ring
> >>>>>>>>> and hence we can safely call force signal on all that are
> >>>>>>>>> currently there.
> >>>>>>>>>      * Any subsequently created  HW fences will be returned
> >>>>>>>>> signaled with an error code right away
> >>>>>>>>>      */
> >>>>>>>>>
> >>>>>>>>>     for_each_ring(adev)
> >>>>>>>>>         amdgpu_fence_process(ring)
> >>>>>>>>>
> >>>>>>>>>     drm_dev_unplug(dev);
> >>>>>>>>>     Stop schedulers
> >>>>>>>>>     cancel_sync(all timers and queued works);
> >>>>>>>>>     hw_fini
> >>>>>>>>>     unmap_mmio
> >>>>>>>>>
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Andrey
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping
> >>>>>>>>>>>>>> and then restarting the scheduler could work as well.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Christian.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse
> >>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for
> >>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the
> >>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences
> >>>>>>>>>>>>> should be covered, so any code leading to
> >>>>>>>>>>>>> amdgpu_fence_emit needs to be taken into account such as,
> >>>>>>>>>>>>> direct IB submissions, VM flushes e.t.c
> >>>>>>>>>>>>
> >>>>>>>>>>>> You need to work together with the reset lock anyway, cause
> >>>>>>>>>>>> a hotplug could run at the same time as a reset.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> For going my way indeed now I see now that I have to take
> >>>>>>>>>>> reset write side lock during HW fences signalling in order
> >>>>>>>>>>> to protect against scheduler/HW fences detachment and
> >>>>>>>>>>> reattachment during schedulers stop/restart. But if we go
> >>>>>>>>>>> with your approach  then calling drm_dev_unplug and scoping
> >>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough
> >>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact
> >>>>>>>>>>> I already do it anyway -
> >>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&reserved=0
> >>>>>>>>>>
> >>>>>>>>>> Yes, good point as well.
> >>>>>>>>>>
> >>>>>>>>>> Christian.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Andrey
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Christian.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Andrey
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Christian.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Andrey
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Andrey
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch