[PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

Fri Aug 27 10:42:40 UTC 2021

[Public]

Hi Andrey and Christian,

I just send out a new patch to address this, I am not sure if I understand your point correctly. Please review.

The patch is to stop scheduler in fence_hw_fini and start the scheduler in fence_hw_init.

Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com> 
Sent: Monday, August 23, 2021 10:42 PM
To: Christian König <ckoenig.leichtzumerken at gmail.com>; Chen, Guchun <Guchun.Chen at amd.com>; Alex Deucher <alexdeucher at gmail.com>; Mike Lothian <mike at fireburn.co.uk>; Koenig, Christian <Christian.Koenig at amd.com>
Cc: amd-gfx list <amd-gfx at lists.freedesktop.org>; Gao, Likun <Likun.Gao at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

On 2021-08-23 2:50 a.m., Christian König wrote:
> Good mornings guys,
>
> Andrey has a rather valid concern here, but I think we need to 
> approach this from a more high level view.
>
> When hw_fini is called we should make sure that the scheduler can't 
> submit any more work to the hardware, because the hw is finalized and 
> not expected to response any more.
>
> As far as I can see the cleanest approach would be to stop the 
> scheduler in hw_fini and fully clean it up in sw_fini. That would also 
> fit quite nicely with how GPU reset is supposed to work I think.
>
> Problem is that this is currently done outside of the fence code for 
> the at least the reset case, so before we restructure that we need to 
> stick with what we have.
>
> Andrey do you think it would be any problem if we stop the scheduler 
> manually in the hot plug case as well?

As long as it's 'parked' inside HW fini - meaning the thread submitting to HW is done I think it should cover hot unplug as well.

Andrey

>
> Thanks,
> Christian.
>
> Am 23.08.21 um 08:36 schrieb Chen, Guchun:
>> [Public]
>>
>> Hi Andrey,
>>
>> Thanks for your notice. The cause why moving drm_sched_fini to 
>> sw_fini is it's a SW behavior and part of SW shutdown, so hw_fini 
>> should not touch it. But if the race, that scheduler on the ring 
>> possibly keeps submitting jobs which causes un-empty ring is there, 
>> possibly we still need to call drm_sched_fini first in hw_fini to 
>> stop job submission first.
>>
>> @Koenig, Christian what's your opinion?
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher at gmail.com>
>> Sent: Friday, August 20, 2021 2:13 AM
>> To: Mike Lothian <mike at fireburn.co.uk>
>> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, Guchun 
>> <Guchun.Chen at amd.com>; amd-gfx list <amd-gfx at lists.freedesktop.org>; 
>> Gao, Likun <Likun.Gao at amd.com>; Koenig, Christian 
>> <Christian.Koenig at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; 
>> Deucher, Alexander <Alexander.Deucher at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
>> fini in s3 test (v2)
>>
>> Please go ahead.  Thanks!
>>
>> Alex
>>
>> On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian <mike at fireburn.co.uk>
>> wrote:
>>> Hi
>>>
>>> Do I need to open a new bug report for this?
>>>
>>> Cheers
>>>
>>> Mike
>>>
>>> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky 
>>> <andrey.grodzovsky at amd.com> wrote:
>>>>
>>>> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
>>>>> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
>>>>> stop scheduler in s3 test, otherwise, fence related failure will 
>>>>> arrive after resume. To fix this and for a better clean up, move 
>>>>> drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part 
>>>>> of driver shutdown, and should never be called in hw_fini.
>>>>>
>>>>> v2: rename amdgpu_fence_driver_init to 
>>>>> amdgpu_fence_driver_sw_init, to keep sw_init and sw_fini paired.
>>>>>
>>>>> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable 
>>>>> sequence
>>>>> Suggested-by: Christian König <christian.koenig at amd.com>
>>>>> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++++++-----
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>>>>>    3 files changed, 11 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index b1d2dc39e8be..9e53ff851496 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
>>>>> *adev,
>>>>>
>>>>>    fence_driver_init:
>>>>>        /* Fence driver */
>>>>> -     r = amdgpu_fence_driver_init(adev);
>>>>> +     r = amdgpu_fence_driver_sw_init(adev);
>>>>>        if (r) {
>>>>> -             dev_err(adev->dev, "amdgpu_fence_driver_init 
>>>>> failed\n");
>>>>> +             dev_err(adev->dev, "amdgpu_fence_driver_sw_init  
>>>>> +failed\n");
>>>>>                amdgpu_vf_error_put(adev, 
>>>>> AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);
>>>>>                goto failed;
>>>>>        }
>>>>> @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device 
>>>>> *dev, bool fbcon)
>>>>>        }
>>>>>        amdgpu_fence_driver_hw_init(adev);
>>>>>
>>>>> -
>>>>>        r = amdgpu_device_ip_late_init(adev);
>>>>>        if (r)
>>>>>                return r;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 49c5c7331c53..7495911516c2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * amdgpu_fence_driver_init - init the fence driver
>>>>> + * amdgpu_fence_driver_sw_init - init the fence driver
>>>>>     * for all possible rings.
>>>>>     *
>>>>>     * @adev: amdgpu device pointer @@ -509,13 +509,13 @@ int 
>>>>> amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>     * amdgpu_fence_driver_start_ring().
>>>>>     * Returns 0 for success.
>>>>>     */
>>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
>>>>>    {
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * amdgpu_fence_driver_fini - tear down the fence driver
>>>>> + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>>>>>     * for all possible rings.
>>>>>     *
>>>>>     * @adev: amdgpu device pointer @@ -531,8 +531,7 @@ void 
>>>>> amdgpu_fence_driver_hw_fini(struct
>>>>> amdgpu_device *adev)
>>>>>
>>>>>                if (!ring || !ring->fence_drv.initialized)
>>>>>                        continue;
>>>>> -             if (!ring->no_scheduler)
>>>>> -                     drm_sched_fini(&ring->sched);
>>>>> +
>>>>>                /* You can't wait for HW to signal if it's gone */
>>>>>                if (!drm_dev_is_unplugged(&adev->ddev))
>>>>>                        r = amdgpu_fence_wait_empty(ring);
>>>>
>>>> Sorry for late notice, missed this patch. By moving drm_sched_fini 
>>>> past amdgpu_fence_wait_empty a race is created as even after you 
>>>> waited for all fences on the ring to signal the sw scheduler will 
>>>> keep submitting new jobs on the ring and so the ring won't stay empty.
>>>>
>>>> For hot device removal also we want to prevent any access to HW 
>>>> past PCI removal in order to not do any MMIO accesses inside the 
>>>> physical MMIO range that no longer belongs to this device after 
>>>> it's removal by the PCI core. Stopping all the schedulers prevents 
>>>> any MMIO accesses done during job submissions and that why 
>>>> drm_sched_fini was done as part of amdgpu_fence_driver_hw_fini and 
>>>> not amdgpu_fence_driver_sw_fini
>>>>
>>>> Andrey
>>>>
>>>>> @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct
>>>>> amdgpu_device *adev)
>>>>>                if (!ring || !ring->fence_drv.initialized)
>>>>>                        continue;
>>>>>
>>>>> +             if (!ring->no_scheduler)
>>>>> +                     drm_sched_fini(&ring->sched);
>>>>> +
>>>>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; 
>>>>> ++j) dma_fence_put(ring->fence_drv.fences[j]);
>>>>>                kfree(ring->fence_drv.fences); diff --git 
>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 27adffa7658d..9c11ced4312c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -106,7 +106,6 @@ struct amdgpu_fence_driver {
>>>>>        struct dma_fence                **fences;
>>>>>    };
>>>>>
>>>>> -int amdgpu_fence_driver_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_force_completion(struct amdgpu_ring 
>>>>> *ring);
>>>>>
>>>>>    int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, @@
>>>>> -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring(struct
>>>>> amdgpu_ring *ring,
>>>>>    int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
>>>>>                                   struct amdgpu_irq_src *irq_src,
>>>>>                                   unsigned irq_type);
>>>>> +void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);
>>>>> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev);
>>>>>    void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev); 
>>>>> -void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
>>>>>    int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
>>>>> dma_fence **fence,
>>>>>                      unsigned flags);
>>>>>    int amdgpu_fence_emit_polling(struct amdgpu_ring *ring, 
>>>>> uint32_t *s,
>