[PATCH] drm/amdgpu/sdma: don't actually disable any SDMA rings via debugfs
Alex Deucher
alexdeucher at gmail.com
Wed Jul 2 13:12:36 UTC 2025
On Wed, Jul 2, 2025 at 9:03 AM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Tue, Jul 1, 2025 at 10:08 PM Zhang, Jesse(Jie) <Jesse.Zhang at amd.com> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi Alex,
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Alex Deucher
> > Sent: Tuesday, July 1, 2025 11:26 PM
> > To: amd-gfx at lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> > Subject: [PATCH] drm/amdgpu/sdma: don't actually disable any SDMA rings via debugfs
> >
> > We can disable various queues via debugfs for IGT testing, but in doing so, we race with the kernel for VM updates or buffer moves.
> >
> > Fixes: d2e3961ae371 ("drm/amdgpu: add amdgpu_sdma_sched_mask debugfs")
> > Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 25 ++++--------------------
> > 1 file changed, 4 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> > index 8b8a04138711c..4f98d4920f5cf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> > @@ -350,9 +350,8 @@ int amdgpu_sdma_ras_sw_init(struct amdgpu_device *adev) static int amdgpu_debugfs_sdma_sched_mask_set(void *data, u64 val) {
> > struct amdgpu_device *adev = (struct amdgpu_device *)data;
> > - u64 i, num_ring;
> > + u64 num_ring;
> > u64 mask = 0;
> > - struct amdgpu_ring *ring, *page = NULL;
> >
> > if (!adev)
> > return -ENODEV;
> > @@ -372,25 +371,9 @@ static int amdgpu_debugfs_sdma_sched_mask_set(void *data, u64 val)
> >
> > if ((val & mask) == 0)
> > return -EINVAL;
> > -
> > - for (i = 0; i < adev->sdma.num_instances; ++i) {
> > - ring = &adev->sdma.instance[i].ring;
> > - if (adev->sdma.has_page_queue)
> > - page = &adev->sdma.instance[i].page;
> > - if (val & BIT_ULL(i * num_ring))
> > - ring->sched.ready = true;
> > - else
> > - ring->sched.ready = false;
> >
> >
> > Is it possible to change the write ring->sched.ready via WRITE_ONCE or atomic_set to avoid the race?
> > And check val to avoid disabling all sdma queues.
> > /* Get current valid mask (reuses _get logic) */
> > ret = amdgpu_debugfs_sdma_sched_mask_get(data, current_mask);
> > if (ret)
> > return ret;
> >
> > /* Reject invalid masks */
> > if (val & ~current_mask)
> > return -EINVAL;
>
> There are two things we need to handle.
> 1. The ring used for BO moves and clears:
> adev->mman.buffer_funcs_ring
> This would need to be changed to a different SDMA ring if the once
> currently assigned is disabled or we'd need to fall back to do copies
> and clears with the CPU, but that won't work without large BARs.
> 2. The VM scheduling entities:
> vm->immediate
> vm->delayed
> We'd need to adjust adev->vm_manager.vm_pte_scheds and
> adev->vm_manager.vm_pte_num_scheds to reflect what's currently
> disabled and then update the drm sched entity.
Here's the segfault I'm seeing with the latter:
[ 193.939200] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 193.939226] #PF: supervisor read access in kernel mode
[ 193.939238] #PF: error_code(0x0000) - not-present page
[ 193.939250] PGD 10836d8067 P4D 10836d8067 PUD 10d5e76067 PMD 0
[ 193.939275] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 193.939291] CPU: 15 UID: 0 PID: 4678 Comm: amd_deadlock Tainted: G
E 6.14.0+ #1976
[ 193.939312] Tainted: [E]=UNSIGNED_MODULE
[ 193.939322] Hardware name: System manufacturer System Product
Name/ROG STRIX X399-E GAMING, BIOS 1002 02/15/2019
[ 193.939339] RIP: 0010:drm_sched_job_arm+0x1f/0x50 [gpu_sched]
[ 193.939366] Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00
55 53 48 8b 6f 20 48 85 ed 74 3d 48 89 fb 48 89 ef e8 a5 36 00 00 48
8b 45 18 <48> 8b 10 48 89 53 10 8b 45 2c 89 43 28 b8 01 00 00 00 f0 48
0f c1
[ 193.939395] RSP: 0018:ffffa70898ecbaa8 EFLAGS: 00010206
[ 193.939410] RAX: 0000000000000000 RBX: ffff8e382cc3e400 RCX: 0000000000000000
[ 193.939425] RDX: 0000000000000001 RSI: ffff8e3808c16ed0 RDI: 00000000ffffffff
[ 193.939440] RBP: ffff8e384213b350 R08: ffff8e3811ab0968 R09: 0000000000000000
[ 193.939454] R10: ffff8e3808c16ed0 R11: 0000000000000003 R12: ffffa70898ecbc20
[ 193.939468] R13: ffff8e382cc3e400 R14: 0000000000000000 R15: 0000000000000000
[ 193.939484] FS: 00007fd1a992bac0(0000) GS:ffff8e47be580000(0000)
knlGS:0000000000000000
[ 193.939502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 193.939515] CR2: 0000000000000000 CR3: 00000010ded50000 CR4: 00000000003506f0
[ 193.939530] Call Trace:
[ 193.939541] <TASK>
[ 193.939553] ? __die_body.cold+0x19/0x27
[ 193.939571] ? page_fault_oops+0x116/0x280
[ 193.939587] ? srso_return_thunk+0x5/0x5f
[ 193.939603] ? srso_return_thunk+0x5/0x5f
[ 193.939617] ? do_user_addr_fault+0x63/0x620
[ 193.939630] ? irq_work_queue+0xa/0x50
[ 193.939649] ? exc_page_fault+0x7a/0x190
[ 193.939665] ? asm_exc_page_fault+0x22/0x30
[ 193.939688] ? drm_sched_job_arm+0x1f/0x50 [gpu_sched]
[ 193.939711] ? drm_sched_job_arm+0x1b/0x50 [gpu_sched]
[ 193.939732] amdgpu_job_submit+0x15/0xe0 [amdgpu]
[ 193.940502] amdgpu_vm_sdma_commit+0x76/0x210 [amdgpu]
[ 193.941144] amdgpu_vm_update_range+0x423/0x830 [amdgpu]
[ 193.941631] amdgpu_vm_clear_freed+0x108/0x270 [amdgpu]
[ 193.942063] amdgpu_gem_va_ioctl+0x4be/0x800 [amdgpu]
[ 193.942475] ? up_read+0x37/0x70
[ 193.942492] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 193.942904] drm_ioctl_kernel+0x82/0xe0 [drm]
[ 193.942974] drm_ioctl+0x25c/0x4f0 [drm]
[ 193.943038] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 193.943457] amdgpu_drm_ioctl+0x47/0x80 [amdgpu]
[ 193.943853] __x64_sys_ioctl+0x93/0xc0
[ 193.943867] do_syscall_64+0x62/0x180
[ 193.943882] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 193.943895] RIP: 0033:0x7fd1ab8e514d
[ 193.943934] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00
00 00
[ 193.943957] RSP: 002b:00007ffe6261f400 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 193.943973] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd1ab8e514d
[ 193.943985] RDX: 00007ffe6261f4a0 RSI: 00000000c0286448 RDI: 0000000000000004
[ 193.943996] RBP: 00007ffe6261f450 R08: 0000000110000000 R09: 000000000000000e
[ 193.944008] R10: 0000000000000003 R11: 0000000000000246 R12: 00007ffe6261f4a0
[ 193.944019] R13: 00000000c0286448 R14: 0000000000000004 R15: 0000000021e92a70
Alex
>
> Alex
>
> > -
> > - if (page) {
> > - if (val & BIT_ULL(i * num_ring + 1))
> > - page->sched.ready = true;
> > - else
> > - page->sched.ready = false;
> > - }
> > - }
> > - /* publish sched.ready flag update effective immediately across smp */
> > - smp_rmb();
> > + /* Just return success here. We can't disable any rings otherwise
> > + * we race with vm udpates or buffer ops.
> > + */
> > return 0;
> > }
> >
> > --
> > 2.50.0
> >
More information about the amd-gfx
mailing list