[PATCH] drm/amdgpu/sdma: don't actually disable any SDMA rings via debugfs

Wed Jul 2 13:03:15 UTC 2025

On Tue, Jul 1, 2025 at 10:08 PM Zhang, Jesse(Jie) <Jesse.Zhang at amd.com> wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alex,
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Alex Deucher
> Sent: Tuesday, July 1, 2025 11:26 PM
> To: amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> Subject: [PATCH] drm/amdgpu/sdma: don't actually disable any SDMA rings via debugfs
>
> We can disable various queues via debugfs for IGT testing, but in doing so, we race with the kernel for VM updates or buffer moves.
>
> Fixes: d2e3961ae371 ("drm/amdgpu: add amdgpu_sdma_sched_mask debugfs")
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 25 ++++--------------------
>  1 file changed, 4 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index 8b8a04138711c..4f98d4920f5cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -350,9 +350,8 @@ int amdgpu_sdma_ras_sw_init(struct amdgpu_device *adev)  static int amdgpu_debugfs_sdma_sched_mask_set(void *data, u64 val)  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)data;
> -       u64 i, num_ring;
> +       u64 num_ring;
>         u64 mask = 0;
> -       struct amdgpu_ring *ring, *page = NULL;
>
>         if (!adev)
>                 return -ENODEV;
> @@ -372,25 +371,9 @@ static int amdgpu_debugfs_sdma_sched_mask_set(void *data, u64 val)
>
>         if ((val & mask) == 0)
>                 return -EINVAL;
> -
> -       for (i = 0; i < adev->sdma.num_instances; ++i) {
> -               ring = &adev->sdma.instance[i].ring;
> -               if (adev->sdma.has_page_queue)
> -                       page = &adev->sdma.instance[i].page;
> -               if (val & BIT_ULL(i * num_ring))
> -                       ring->sched.ready = true;
> -               else
> -                       ring->sched.ready = false;
>
>
> Is it possible to change the write ring->sched.ready  via WRITE_ONCE or atomic_set to avoid the race?
> And check val to avoid disabling all sdma queues.
>     /* Get current valid mask (reuses _get logic) */
>     ret = amdgpu_debugfs_sdma_sched_mask_get(data, current_mask);
>     if (ret)
>         return ret;
>
>     /* Reject invalid masks */
>     if (val & ~current_mask)
>         return -EINVAL;

There are two things we need to handle.
1. The ring used for BO moves and clears:
adev->mman.buffer_funcs_ring
This would need to be changed to a different SDMA ring if the once
currently assigned is disabled or we'd need to fall back to do copies
and clears with the CPU, but that won't work without large BARs.
2. The VM scheduling entities:
vm->immediate
vm->delayed
We'd need to adjust adev->vm_manager.vm_pte_scheds and
adev->vm_manager.vm_pte_num_scheds to reflect what's currently
disabled and then update the drm sched entity.

Alex

> -
> -               if (page) {
> -                       if (val & BIT_ULL(i * num_ring + 1))
> -                               page->sched.ready = true;
> -                       else
> -                               page->sched.ready = false;
> -               }
> -       }
> -       /* publish sched.ready flag update effective immediately across smp */
> -       smp_rmb();
> +       /* Just return success here. We can't disable any rings otherwise
> +        * we race with vm udpates or buffer ops.
> +        */
>         return 0;
>  }
>
> --
> 2.50.0
>