[PATCH] drm/amdgpu: Register VGA clients after init can no longer fail

Alex Deucher alexdeucher at gmail.com
Fri Apr 23 18:59:54 UTC 2021


On Wed, Apr 21, 2021 at 1:43 PM Kai-Heng Feng
<kai.heng.feng at canonical.com> wrote:
>
> When an amdgpu device fails to init, it makes another VGA device cause
> kernel splat:
> kernel: amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed
> kernel: amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init
> kernel: amdgpu: probe of 0000:08:00.0 failed with error -110
> ...
> kernel: amdgpu 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
> kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018
> kernel: #PF: supervisor read access in kernel mode
> kernel: #PF: error_code(0x0000) - not-present page
> kernel: PGD 0 P4D 0
> kernel: Oops: 0000 [#1] SMP NOPTI
> kernel: CPU: 6 PID: 1080 Comm: Xorg Tainted: G        W         5.12.0-rc8+ #12
> kernel: Hardware name: HP HP EliteDesk 805 G6/872B, BIOS S09 Ver. 02.02.00 12/30/2020
> kernel: RIP: 0010:amdgpu_device_vga_set_decode+0x13/0x30 [amdgpu]
> kernel: Code: 06 31 c0 c3 b8 ea ff ff ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 8b 87 90 06 00 00 48 89 e5 53 89 f3 <48> 8b 40 18 40 0f b6 f6 e8 40 58 39 fd 80 fb 01 5b 5d 19 c0 83 e0
> kernel: RSP: 0018:ffffae3c0246bd68 EFLAGS: 00010002
> kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> kernel: RDX: ffff8dd1af5a8560 RSI: 0000000000000000 RDI: ffff8dce8c160000
> kernel: RBP: ffffae3c0246bd70 R08: ffff8dd1af5985c0 R09: ffffae3c0246ba38
> kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000246
> kernel: R13: 0000000000000000 R14: 0000000000000003 R15: ffff8dce81490000
> kernel: FS:  00007f9303d8fa40(0000) GS:ffff8dd1af580000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 0000000000000018 CR3: 0000000103cfa000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel:  vga_arbiter_notify_clients.part.0+0x4a/0x80
> kernel:  vga_get+0x17f/0x1c0
> kernel:  vga_arb_write+0x121/0x6a0
> kernel:  ? apparmor_file_permission+0x1c/0x20
> kernel:  ? security_file_permission+0x30/0x180
> kernel:  vfs_write+0xca/0x280
> kernel:  ksys_write+0x67/0xe0
> kernel:  __x64_sys_write+0x1a/0x20
> kernel:  do_syscall_64+0x38/0x90
> kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> kernel: RIP: 0033:0x7f93041e02f7
> kernel: Code: 75 05 48 83 c4 58 c3 e8 f7 33 ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> kernel: RSP: 002b:00007fff60e49b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f93041e02f7
> kernel: RDX: 000000000000000b RSI: 00007fff60e49b40 RDI: 000000000000000f
> kernel: RBP: 00007fff60e49b40 R08: 00000000ffffffff R09: 00007fff60e499d0
> kernel: R10: 00007f93049350b5 R11: 0000000000000246 R12: 000056111d45e808
> kernel: R13: 0000000000000000 R14: 000056111d45e7f8 R15: 000056111d46c980
> kernel: Modules linked in: nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq input_leds snd_seq_device snd_timer snd soundcore joydev kvm_amd serio_raw k10temp mac_hid hp_wmi ccp kvm sparse_keymap wmi_bmof ucsi_acpi efi_pstore typec_ucsi rapl typec video wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel cec rc_core aesni_intel crypto_simd psmouse cryptd r8169 i2c_piix4 drm ahci xhci_pci realtek libahci xhci_pci_renesas gpio_amdpt gpio_generic
> kernel: CR2: 0000000000000018
> kernel: ---[ end trace 76d04313d4214c51 ]---
>
> Commit 4192f7b57689 ("drm/amdgpu: unmap register bar on device init
> failure") makes amdgpu_driver_unload_kms() skips amdgpu_device_fini(),
> so the VGA clients remain registered. So when
> vga_arbiter_notify_clients() iterates over registered clients, it causes
> NULL pointer dereference.
>
> Since there's no reason to register VGA clients that early, so solve
> the issue by putting them after all the goto cleanups.
>
> Fixes: 4192f7b57689 ("drm/amdgpu: unmap register bar on device init failure")
> Signed-off-by: Kai-Heng Feng <kai.heng.feng at canonical.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++++++-----------
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b4ad1c055c70..115a7699e11e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3410,19 +3410,6 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         /* doorbell bar mapping and doorbell index init*/
>         amdgpu_device_doorbell_init(adev);
>
> -       /* if we have > 1 VGA cards, then disable the amdgpu VGA resources */
> -       /* this will fail for cards that aren't VGA class devices, just
> -        * ignore it */
> -       if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
> -               vga_client_register(adev->pdev, adev, NULL, amdgpu_device_vga_set_decode);
> -
> -       if (amdgpu_device_supports_px(ddev)) {
> -               px = true;
> -               vga_switcheroo_register_client(adev->pdev,
> -                                              &amdgpu_switcheroo_ops, px);
> -               vga_switcheroo_init_domain_pm_ops(adev->dev, &adev->vga_pm_domain);
> -       }
> -
>         if (amdgpu_emu_mode == 1) {
>                 /* post the asic on emulation mode */
>                 emu_soc_asic_init(adev);
> @@ -3619,6 +3606,19 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         if (amdgpu_device_cache_pci_state(adev->pdev))
>                 pci_restore_state(pdev);
>
> +       /* if we have > 1 VGA cards, then disable the amdgpu VGA resources */
> +       /* this will fail for cards that aren't VGA class devices, just
> +        * ignore it */
> +       if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
> +               vga_client_register(adev->pdev, adev, NULL, amdgpu_device_vga_set_decode);
> +
> +       if (amdgpu_device_supports_px(ddev)) {
> +               px = true;
> +               vga_switcheroo_register_client(adev->pdev,
> +                                              &amdgpu_switcheroo_ops, px);
> +               vga_switcheroo_init_domain_pm_ops(adev->dev, &adev->vga_pm_domain);
> +       }
> +

If we move vga_switcheroo_init_domain_pm_ops() here, I think we can
remove the vga_switcheroo cleanup in the failed: label.

Alex


>         if (adev->gmc.xgmi.pending_reset)
>                 queue_delayed_work(system_wq, &mgpu_info.delayed_reset_work,
>                                    msecs_to_jiffies(AMDGPU_RESUME_MS));
> --
> 2.30.2
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


More information about the dri-devel mailing list