[PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

Thu Oct 20 14:59:01 UTC 2022

Am 20.10.22 um 16:49 schrieb Michel Dänzer:
> On 2022-10-18 11:08, jiadong.zhu at amd.com wrote:
>> From: "Jiadong.Zhu" <Jiadong.Zhu at amd.com>
>>
>> The software ring is created to support priority context while there is only
>> one hardware queue for gfx.
>>
>> Every software ring has its fence driver and could be used as an ordinary ring
>> for the GPU scheduler.
>> Multiple software rings are bound to a real ring with the ring muxer. The
>> packages committed on the software ring are copied to the real ring.
>>
>> v2: Use array to store software ring entry.
>> v3: Remove unnecessary prints.
>> v4: Remove amdgpu_ring_sw_init/fini functions,
>> using gtt for sw ring buffer for later dma copy
>> optimization.
>> v5: Allocate ring entry dynamically in the muxer.
>> v6: Update comments for the ring muxer.
>> v7: Modify for function naming.
>> v8: Combine software ring functions into amdgpu_ring_mux.c
> I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below.

As long as you don't try to reset the GPU you can also test patch 5. 
It's just that we can't upstream the stuff like this or that would break 
immediately.

Regards,
Christian.

>
> amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376:
>
> 	spin_unlock(&mux->lock);
>
> though I'm not sure why that would crash.
>
>
> Are you not able to reproduce this with a GNOME Wayland session?
>
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 0 P4D 0
> Oops: 0010 [#1] PREEMPT SMP NOPTI
> CPU: 7 PID: 281 Comm: gfx_high Tainted: G            E      6.0.2+ #1
> Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 ) 03/30/2020
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
> RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350
> RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114
> R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128
> FS:  0000000000000000(0000) GS:ffff993ef0bc0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0
> Call Trace:
>   <TASK>
>   amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu]
>   amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu]
>   amdgpu_job_run+0x102/0x1c0 [amdgpu]
>   drm_sched_main+0x19a/0x440 [gpu_sched]
>   ? dequeue_task_stop+0x70/0x70
>   ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
>   kthread+0xe9/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x22/0x30
>   </TASK>
> [...]
> note: gfx_high[281] exited with preempt_count 1
> [...]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=14864, emitted seq=14865
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666
> amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
>
>