Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

Tue Jul 19 08:40:46 UTC 2022

I was told that this patch replaces the patch you mentioned
https://patchwork.freedesktop.org/series/106078/ and it the one
that'll hopefully land in Linus's tree

On Tue, 19 Jul 2022 at 03:33, Chen, Guchun <Guchun.Chen at amd.com> wrote:
>
> Patch https://patchwork.freedesktop.org/series/106024/ should fix this.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Mikhail Gavrilov
> Sent: Tuesday, July 19, 2022 7:50 AM
> To: amd-gfx list <amd-gfx at lists.freedesktop.org>; Linux List Kernel Mailing <linux-kernel at vger.kernel.org>; Christian König <ckoenig.leichtzumerken at gmail.com>
> Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
>
> Hi guys I continue testing 5.19 rc7 and found the bug.
> Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
> 0000000000000008 on driver amdgpu.
>
> Here is trace:
> [ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x0000) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
> -------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
> [ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
> 02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
> 02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
> 93 53
> [ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000 [ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198 [ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000 [ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198 [ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
> knlGS:0000000000000000
> [ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0 [ 1320.203372] Call Trace:
> [ 1320.203374] <TASK>
> [ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
> [ 1320.203758] ret_from_fork+0x22/0x30
> [ 1320.203764] </TASK>
>
> Full kernel log is here:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2FEeKh2LEr&data=05%7C01%7Cguchun.chen%40amd.com%7C06749e19d65b418748dc08da6918435f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637937850184140997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=x1%2FR7m9Vy2XwkXKXsmEOeaAyv44ZKNsU4caZJOOSIvY%3D&reserved=0
>
> And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung.
>
> I will be glad to test patches that fix this bug.
>
> --
> Best Regards,
> Mike Gavrilov.