<div dir="ltr">Hi<div><br></div><div>Do I need to open a new bug report for this?</div><div><br></div><div>Cheers</div><div><br></div><div>Mike</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky <<a href="mailto:andrey.grodzovsky@amd.com">andrey.grodzovsky@amd.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2021-08-02 1:16 a.m., Guchun Chen wrote:<br>
> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop<br>
> scheduler in s3 test, otherwise, fence related failure will arrive<br>
> after resume. To fix this and for a better clean up, move drm_sched_fini<br>
> from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and<br>
> should never be called in hw_fini.<br>
><br>
> v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,<br>
> to keep sw_init and sw_fini paired.<br>
><br>
> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence<br>
> Suggested-by: Christian König <<a href="mailto:christian.koenig@amd.com" target="_blank">christian.koenig@amd.com</a>><br>
> Signed-off-by: Guchun Chen <<a href="mailto:guchun.chen@amd.com" target="_blank">guchun.chen@amd.com</a>><br>
> ---<br>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---<br>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++++++-----<br>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--<br>
>   3 files changed, 11 insertions(+), 10 deletions(-)<br>
><br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> index b1d2dc39e8be..9e53ff851496 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,<br>
>   <br>
>   fence_driver_init:<br>
>       /* Fence driver */<br>
> -     r = amdgpu_fence_driver_init(adev);<br>
> +     r = amdgpu_fence_driver_sw_init(adev);<br>
>       if (r) {<br>
> -             dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");<br>
> +             dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");<br>
>               amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);<br>
>               goto failed;<br>
>       }<br>
> @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)<br>
>       }<br>
>       amdgpu_fence_driver_hw_init(adev);<br>
>   <br>
> -<br>
>       r = amdgpu_device_ip_late_init(adev);<br>
>       if (r)<br>
>               return r;<br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c<br>
> index 49c5c7331c53..7495911516c2 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c<br>
> @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,<br>
>   }<br>
>   <br>
>   /**<br>
> - * amdgpu_fence_driver_init - init the fence driver<br>
> + * amdgpu_fence_driver_sw_init - init the fence driver<br>
>    * for all possible rings.<br>
>    *<br>
>    * @adev: amdgpu device pointer<br>
> @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,<br>
>    * amdgpu_fence_driver_start_ring().<br>
>    * Returns 0 for success.<br>
>    */<br>
> -int amdgpu_fence_driver_init(struct amdgpu_device *adev)<br>
> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)<br>
>   {<br>
>       return 0;<br>
>   }<br>
>   <br>
>   /**<br>
> - * amdgpu_fence_driver_fini - tear down the fence driver<br>
> + * amdgpu_fence_driver_hw_fini - tear down the fence driver<br>
>    * for all possible rings.<br>
>    *<br>
>    * @adev: amdgpu device pointer<br>
> @@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)<br>
>   <br>
>               if (!ring || !ring->fence_drv.initialized)<br>
>                       continue;<br>
> -             if (!ring->no_scheduler)<br>
> -                     drm_sched_fini(&ring->sched);<br>
> +<br>
>               /* You can't wait for HW to signal if it's gone */<br>
>               if (!drm_dev_is_unplugged(&adev->ddev))<br>
>                       r = amdgpu_fence_wait_empty(ring);<br>
<br>
<br>
Sorry for late notice, missed this patch. By moving drm_sched_fini<br>
past amdgpu_fence_wait_empty a race is created as even after you waited<br>
for all fences on the ring to signal the sw scheduler will keep submitting<br>
new jobs on the ring and so the ring won't stay empty.<br>
<br>
For hot device removal also we want to prevent any access to HW past PCI <br>
removal<br>
in order to not do any MMIO accesses inside the physical MMIO range that <br>
no longer<br>
belongs to this device after it's removal by the PCI core. Stopping all <br>
the schedulers prevents any MMIO<br>
accesses done during job submissions and that why drm_sched_fini was <br>
done as part of amdgpu_fence_driver_hw_fini<br>
and not amdgpu_fence_driver_sw_fini<br>
<br>
Andrey<br>
<br>
> @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)<br>
>               if (!ring || !ring->fence_drv.initialized)<br>
>                       continue;<br>
>   <br>
> +             if (!ring->no_scheduler)<br>
> +                     drm_sched_fini(&ring->sched);<br>
> +<br>
>               for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)<br>
>                       dma_fence_put(ring->fence_drv.fences[j]);<br>
>               kfree(ring->fence_drv.fences);<br>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h<br>
> index 27adffa7658d..9c11ced4312c 100644<br>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h<br>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h<br>
> @@ -106,7 +106,6 @@ struct amdgpu_fence_driver {<br>
>       struct dma_fence                **fences;<br>
>   };<br>
>   <br>
> -int amdgpu_fence_driver_init(struct amdgpu_device *adev);<br>
>   void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);<br>
>   <br>
>   int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,<br>
> @@ -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,<br>
>   int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,<br>
>                                  struct amdgpu_irq_src *irq_src,<br>
>                                  unsigned irq_type);<br>
> +void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);<br>
>   void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);<br>
> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev);<br>
>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev);<br>
> -void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);<br>
>   int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **fence,<br>
>                     unsigned flags);<br>
>   int amdgpu_fence_emit_polling(struct amdgpu_ring *ring, uint32_t *s,<br>
</blockquote></div>