[PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

Mon Oct 31 08:10:30 UTC 2022

[AMD Official Use Only - General]

Hi Michel,

Sorry for the late response. It is more likely the null pointer is raised from function amdgpu_ring_preempt_ib as preempt_ib is not assigned.

Could you have a check preempt_ib  is set in gfx_v9_0.c from the patch "drm/amdgpu: Modify unmap_queue format for gfx9 (v4)".

"
...
.preempt_ib = gfx_v9_0_ring_preempt_ib,
"

Btw, Which branch of kmd are you cherry-pick? Maybe my code base is too old.
I tried using wayland on ubuntu 20.04 and not reproduced the crash.

Thanks,
Jiadong

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Thursday, October 20, 2022 10:59 PM
To: Michel Dänzer <michel at daenzer.net>; Zhu, Jiadong <Jiadong.Zhu at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Tuikov, Luben <Luben.Tuikov at amd.com>
Subject: Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

Am 20.10.22 um 16:49 schrieb Michel Dänzer:
> On 2022-10-18 11:08, jiadong.zhu at amd.com wrote:
>> From: "Jiadong.Zhu" <Jiadong.Zhu at amd.com>
>>
>> The software ring is created to support priority context while there
>> is only one hardware queue for gfx.
>>
>> Every software ring has its fence driver and could be used as an
>> ordinary ring for the GPU scheduler.
>> Multiple software rings are bound to a real ring with the ring muxer.
>> The packages committed on the software ring are copied to the real ring.
>>
>> v2: Use array to store software ring entry.
>> v3: Remove unnecessary prints.
>> v4: Remove amdgpu_ring_sw_init/fini functions, using gtt for sw ring
>> buffer for later dma copy optimization.
>> v5: Allocate ring entry dynamically in the muxer.
>> v6: Update comments for the ring muxer.
>> v7: Modify for function naming.
>> v8: Combine software ring functions into amdgpu_ring_mux.c
> I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below.

As long as you don't try to reset the GPU you can also test patch 5.
It's just that we can't upstream the stuff like this or that would break immediately.

Regards,
Christian.

>
> amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376:
>
>       spin_unlock(&mux->lock);
>
> though I'm not sure why that would crash.
>
>
> Are you not able to reproduce this with a GNOME Wayland session?
>
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page PGD 0 P4D 0
> Oops: 0010 [#1] PREEMPT SMP NOPTI
> CPU: 7 PID: 281 Comm: gfx_high Tainted: G            E      6.0.2+ #1
> Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 )
> 03/30/2020
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
> RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350
> RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114
> R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128
> FS:  0000000000000000(0000) GS:ffff993ef0bc0000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0 Call
> Trace:
>   <TASK>
>   amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu]
>   amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu]
>   amdgpu_job_run+0x102/0x1c0 [amdgpu]
>   drm_sched_main+0x19a/0x440 [gpu_sched]
>   ? dequeue_task_stop+0x70/0x70
>   ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
>   kthread+0xe9/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x22/0x30
>   </TASK>
> [...]
> note: gfx_high[281] exited with preempt_count 1 [...]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout,
> signaled seq=14864, emitted seq=14865 [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666 amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
>
>