[PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)
Michel Dänzer
michel at daenzer.net
Thu Oct 20 14:49:12 UTC 2022
On 2022-10-18 11:08, jiadong.zhu at amd.com wrote:
> From: "Jiadong.Zhu" <Jiadong.Zhu at amd.com>
>
> The software ring is created to support priority context while there is only
> one hardware queue for gfx.
>
> Every software ring has its fence driver and could be used as an ordinary ring
> for the GPU scheduler.
> Multiple software rings are bound to a real ring with the ring muxer. The
> packages committed on the software ring are copied to the real ring.
>
> v2: Use array to store software ring entry.
> v3: Remove unnecessary prints.
> v4: Remove amdgpu_ring_sw_init/fini functions,
> using gtt for sw ring buffer for later dma copy
> optimization.
> v5: Allocate ring entry dynamically in the muxer.
> v6: Update comments for the ring muxer.
> v7: Modify for function naming.
> v8: Combine software ring functions into amdgpu_ring_mux.c
I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below.
amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376:
spin_unlock(&mux->lock);
though I'm not sure why that would crash.
Are you not able to reproduce this with a GNOME Wayland session?
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 281 Comm: gfx_high Tainted: G E 6.0.2+ #1
Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 ) 03/30/2020
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350
RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114
R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128
FS: 0000000000000000(0000) GS:ffff993ef0bc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0
Call Trace:
<TASK>
amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu]
amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu]
amdgpu_job_run+0x102/0x1c0 [amdgpu]
drm_sched_main+0x19a/0x440 [gpu_sched]
? dequeue_task_stop+0x70/0x70
? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
kthread+0xe9/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
[...]
note: gfx_high[281] exited with preempt_count 1
[...]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=14864, emitted seq=14865
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666
amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer
More information about the amd-gfx
mailing list