6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

Mikhail Gavrilov mikhail.v.gavrilov at gmail.com
Mon Nov 6 13:54:34 UTC 2023


Hi,
another release cycle, and another regression.
After another kernel update in Fedora Rawhide GPU not
entering in graphic mode on my laptop ASUS ROG Strix G15 Advantage
Edition G513QY-HQ007.
And in kernel log appears this bug trace:
[   22.574698] ==================================================================
[   22.574704] BUG: KASAN: null-ptr-deref in
amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575115] Read of size 4 at addr 0000000000000180 by task (udev-worker)/504

[   22.575125] CPU: 2 PID: 504 Comm: (udev-worker) Tainted: G        W
   L     6.6.0-last-d2f51b3516dade79269ff45eae2a7668ae711b25+ #163
[   22.575135] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.331 02/24/2023
[   22.575143] Call Trace:
[   22.575147]  <TASK>
[   22.575151]  dump_stack_lvl+0x76/0xd0
[   22.575158]  kasan_report+0xa6/0xe0
[   22.575165]  ? amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575320]  kasan_check_range+0x105/0x1b0
[   22.575320]  amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575320]  gmc_v9_0_late_init+0xcf/0x1b0 [amdgpu]
[   22.575320]  amdgpu_device_ip_late_init+0x103/0x7b0 [amdgpu]
[   22.575320]  amdgpu_device_init+0x7b33/0x8a90 [amdgpu]
[   22.575320]  ? __pfx_amdgpu_device_init+0x10/0x10 [amdgpu]
[   22.575320]  ? __pfx_pci_bus_read_config_word+0x10/0x10
[   22.575320]  ? do_pci_enable_device+0x22d/0x2a0
[   22.575320]  ? __pfx_pci_request_acs+0x1/0x10
[   22.575320]  ? _raw_spin_unlock_irqrestore+0x66/0x80
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? __kasan_check_byte+0x13/0x50
[   22.575320]  amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[   22.575320]  amdgpu_pci_probe+0x282/0xac0 [amdgpu]
[   22.575320]  ? __pfx_amdgpu_pci_probe+0x10/0x10 [amdgpu]
[   22.575320]  local_pci_probe+0xdd/0x190
[   22.575320]  pci_device_probe+0x23a/0x780
[   22.575320]  ? kernfs_add_one+0x326/0x490
[   22.575320]  ? kernfs_get.part.0+0x4c/0x70
[   22.575320]  ? __pfx_pci_device_probe+0x10/0x10
[   22.575320]  ? kernfs_create_link+0x16b/0x230
[   22.575320]  ? kernfs_put+0x1c/0x40
[   22.575320]  ? sysfs_do_create_link_sd+0x8e/0x100
[   22.575320]  really_probe+0x3e2/0xb80
[   22.575320]  __driver_probe_device+0x18c/0x450
[   22.575320]  driver_probe_device+0x4a/0x120
[   22.575320]  __driver_attach+0x1e5/0x4a0
[   22.575320]  ? __pfx___driver_attach+0x10/0x10
[   22.575320]  bus_for_each_dev+0x109/0x190
[   22.575320]  ? __pfx_bus_for_each_dev+0x10/0x10
[   22.575320]  bus_add_driver+0x2a1/0x570
[   22.575320]  driver_register+0x134/0x460
[   22.575320]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[   22.575320]  do_one_initcall+0xd6/0x430
[   22.575320]  ? __pfx_do_one_initcall+0x10/0x10
[   22.575320]  ? kasan_unpoison+0x44/0x70
[   22.575320]  do_init_module+0x238/0x770
[   22.575320]  load_module+0x5581/0x6f10
[   22.575320]  ? __pfx_load_module+0x10/0x10
[   22.575320]  ? ima_post_read_file+0x189/0x1b0
[   22.575320]  ? __pfx_ima_post_read_file+0x10/0x10
[   22.575320]  ? __pfx_bpf_lsm_kernel_post_read_file+0x10/0x10
[   22.575320]  ? kernel_read_file+0x243/0x820
[   22.575320]  ? __pfx_kernel_read_file+0x10/0x10
[   22.575320]  ? init_module_from_file+0xd1/0x130
[   22.575320]  init_module_from_file+0xd1/0x130
[   22.575320]  ? __pfx_init_module_from_file+0x10/0x10
[   22.575320]  ? local_clock_noinstr+0x45/0xc0
[   22.575320]  ? do_raw_spin_unlock+0x58/0x1f0
[   22.575320]  idempotent_init_module+0x235/0x650
[   22.575320]  ? __pfx_idempotent_init_module+0x10/0x10
[   22.575320]  ? __pfx_bpf_lsm_capable+0x10/0x10
[   22.575320]  ? security_capable+0x74/0xb0
[   22.575320]  __x64_sys_finit_module+0xbe/0x130
[   22.575320]  do_syscall_64+0x64/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   22.575320] RIP: 0033:0x7f8ab56bbf8d
[   22.575320] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 4e 0c 00 f7 d8 64 89
01 48
[   22.575320] RSP: 002b:00007ffe2e836608 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[   22.575320] RAX: ffffffffffffffda RBX: 000055f55ef37f30 RCX: 00007f8ab56bbf8d
[   22.575320] RDX: 0000000000000000 RSI: 000055f55ef10950 RDI: 0000000000000015
[   22.575320] RBP: 00007ffe2e8366c0 R08: 0000000000000000 R09: 00007ffe2e836650
[   22.575320] R10: 0000000000000015 R11: 0000000000000246 R12: 000055f55ef10950
[   22.575320] R13: 0000000000020000 R14: 000055f55ef37240 R15: 000055f55ef393d0
[   22.575320]  </TASK>
[   22.575320] ==================================================================


Using bisect, I found out that this commit is to blame
❯ git bisect good
73582be11ac8f6d6765e185bf48f22efb9d28c3b is the first bad commit
commit 73582be11ac8f6d6765e185bf48f22efb9d28c3b
Author: Tao Zhou <tao.zhou1 at amd.com>
Date:   Thu Oct 12 14:33:37 2023 +0800

    drm/amdgpu: bypass RAS error reset in some conditions

    PMFW is responsible for RAS error reset in some conditions, driver can
    skip the operation.

    v2: add check for ras->in_recovery, it's set earlier than
    amdgpu_in_reset.

    v3: fix error in gpu reset check.

    Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
    Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher at amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

I rebuilt the kernel from master with reverted
73582be11ac8f6d6765e185bf48f22efb9d28c3b and my laptop started working
again.

All kernel logs and build config I attached below.
Laptop hardware probe is here: https://linux-hardware.org/?probe=85a38e7906

-- 
Best Regards,
Mike Gavrilov.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bisect-all-steps-dmesg.zip
Type: application/zip
Size: 522595 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231106/6e7859db/attachment-0003.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg-6.6.0-last-d2f51b3516dade79269ff45eae2a7668ae711b25.zip
Type: application/zip
Size: 43517 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231106/6e7859db/attachment-0004.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.zip
Type: application/zip
Size: 65199 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231106/6e7859db/attachment-0005.zip>


More information about the amd-gfx mailing list