[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Fri Apr 9 16:39:45 UTC 2021

Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:
>
> On 2021-04-09 3:01 a.m., Christian König wrote:
>> Am 09.04.21 um 08:53 schrieb Christian König:
>>> Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>> But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
>>>> is much less effort and less room for error then going through each 
>>>> IOCTL and trying to identify at what point (possibly multiple 
>>>> points) they are about to access HW, some of this is hidden deep in 
>>>> HAL layers such as DC layer in display driver or the multi layers 
>>>> of powerplay/SMU libraries. Also, we can't only limit our-self to 
>>>> back-end if by this you mean ASIC specific functions which access 
>>>> registers. We also need to take care of any MMIO kernel BO (VRAM 
>>>> BOs) where we may access directly MMIO space by pointer from the 
>>>> front end of the driver (HW agnostic) and TTM/DRM layers.
>>>
>>> Exactly, yes. The key point is we need to identify such places 
>>> anyway for GPU reset to work properly. So we could just piggy back 
>>> hotplug on top of that work and are done.
>
>
> I see most of this was done By Denis in this patch 
> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
> indeed this doesn't cover the direct by pointer accesses of MMIO and 
> will introduce much more of those and, as people write new code, new 
> places to cover will pop up leading to regressions and extra work to 
> fix. It would be really much better if we could blanket cover it at 
> the very top  such as root of all IOCTLs or, for any queued work/timer 
> at the very top function, to handle it once and for all.

And exactly that's what is not possible. At least for the reset case you 
need to look into each hardware access and handle that bit by bit and I 
think that for the hotplug case we should go down that route as well.

>>>
>>>>
>>>> Our problem here is how to signal all the existing  fences on one 
>>>> hand and on the other prevent any new dma_fence waits after we 
>>>> finished signaling existing fences. Once we solved this then there 
>>>> is no problem using drm_dev_unplug in conjunction with 
>>>> drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
>>>> IOCTLs in flight and block any new ones.
>>>>
>>>> IMHO when we speak about signalling all fences we don't mean ALL 
>>>> the currently existing dma_fence structs (they are spread all over 
>>>> the place) but rather signal all the HW fences because HW is what's 
>>>> gone and we can't expect for those fences to be ever signaled. All 
>>>> the rest such as: scheduler fences, user fences, drm_gem 
>>>> reservation objects e.t.c. are either dependent on those HW fences 
>>>> and hence signaling the HW fences will in turn signal them or, are 
>>>> not impacted by the HW being gone and hence can still be waited on 
>>>> and will complete. If this assumption is correct then I think that 
>>>> we should use some flag to prevent any new submission to HW which 
>>>> creates HW fences (somewhere around amdgpu_fence_emit), then 
>>>> traverse all existing HW fences (currently they are spread in a few 
>>>> places so maybe we need to track them in a list) and signal them. 
>>>> After that it's safe to cal drm_dev_unplug and be sure 
>>>> synchronize_srcu won't stall because of of dma_fence_wait. After 
>>>> that we can proceed to canceling work items, stopping schedulers 
>>>> e.t.c.
>>>
>>> That is problematic as well since you need to make sure that the 
>>> scheduler is not creating a new hardware fence in the moment you try 
>>> to signal all of them. It would require another SRCU or lock for this.
>
>
> If we use a list and a flag called 'emit_allowed' under a lock such 
> that in amdgpu_fence_emit we lock the list, check the flag and if true 
> add the new HW fence to list and proceed to HW emition as normal, 
> otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, 
> set the flag to false, and then iterate the list and force signal it. 
> Will this not prevent any new HW fence creation from now on from any 
> place trying to do so ?

Way to much overhead. The fence processing is intentionally lock free to 
avoid cache line bouncing because the IRQ can move from CPU to CPU.

We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.

>>
>> Alternatively grabbing the reset write side and stopping and then 
>> restarting the scheduler could work as well.
>>
>> Christian.
>
>
> I didn't get the above and I don't see why I need to reuse the GPU 
> reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
> clear to me why are we focusing on the scheduler threads, any code 
> patch to generate HW fences should be covered, so any code leading to 
> amdgpu_fence_emit needs to be taken into account such as, direct IB 
> submissions, VM flushes e.t.c

You need to work together with the reset lock anyway, cause a hotplug 
could run at the same time as a reset.

Christian.

>
> Andrey
>
>
>>
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>
>>>
>>