[regression][6.1] After commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 system randomly hungs

Christian König ckoenig.leichtzumerken at gmail.com
Tue Oct 11 12:05:26 UTC 2022


Yeah, that's a known issue. Dave already reverted the problematic patch 
in drm-next.

But it will probably take a while until the revert propagates down into 
the drm-misc-next or whatever you uses.

Regards,
Christian.

Am 11.10.22 um 14:02 schrieb Mikhail Gavrilov:
> Hi!
> The hungs occurs randomly, but I found good reproductive scenario
> (This is running the campaign in the game Halo Infinite)
> The backtrace is look like this:
>
> [  147.260971] BUG: kernel NULL pointer dereference, address: 0000000000000088
> [  147.260987] ------------[ cut here ]------------
> [  147.260988] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:321
> __local_bh_disable_ip+0x9e/0xb0
> [  147.260993] Modules linked in: uinput rfcomm snd_seq_dummy
> snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
> nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
> nf_tables nfnetlink qrtr bnep sunrpc snd_sof_amd_renoir intel_rapl_msr
> snd_sof_amd_acp intel_rapl_common mt7921e snd_sof_pci mt7921_common
> binfmt_misc snd_sof mt76_connac_lib snd_sof_utils vfat
> snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic mt76 fat
> snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_compress ac97_bus
> btusb kvm_amd snd_intel_dspcfg snd_pcm_dmaengine btrtl
> snd_intel_sdw_acpi btbcm snd_hda_codec snd_pci_acp6x mac80211 kvm
> snd_hda_core btintel btmtk irqbypass snd_hwdep snd_seq libarc4
> snd_seq_device bluetooth snd_pcm snd_pci_acp5x snd_timer
> snd_rn_pci_acp3x cfg80211 rapl pcspkr joydev asus_nb_wmi wmi_bmof
> snd_acp_config snd snd_soc_acpi k10temp
> [  147.261033]  soundcore i2c_piix4 snd_pci_acp3x asus_wireless
> amd_pmc zram amdgpu drm_ttm_helper ttm hid_asus iommu_v2 asus_wmi
> gpu_sched ledtrig_audio sparse_keymap drm_buddy platform_profile
> drm_display_helper crct10dif_pclmul crc32_pclmul nvme rfkill
> crc32c_intel ucsi_acpi hid_multitouch video ghash_clmulni_intel
> nvme_core ccp typec_ucsi serio_raw r8169 cec sp5100_tco typec
> i2c_hid_acpi wmi i2c_hid ip6_tables ip_tables fuse
> [  147.261045] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W    L
>     6.0.0-rc2-02-907cc346ff6a69a08b4786c4ed2a78ac0120b9da+ #124
> [  147.261046] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
> G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
> [  147.261047] RIP: 0010:__local_bh_disable_ip+0x9e/0xb0
> [  147.261048] Code: 25 00 1e 02 00 48 89 df e8 6f 23 08 00 85 c0 75
> 0e 48 89 9d 30 1c 00 00 5b 5d c3 cc cc cc cc 31 ff 31 db e8 54 23 08
> 00 eb e7 <0f> 0b e9 76 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
> 44 00
> [  147.261049] RSP: 0018:ffffa4e1c028c8d8 EFLAGS: 00010006
> [  147.261050] RAX: 0000000080010005 RBX: 0000000000000201 RCX: 0000000000000018
> [  147.261051] RDX: 00000f440b255950 RSI: 0000000000000201 RDI: ffffffffc1b652e5
> [  147.261051] RBP: ffff93a4eaf00fd8 R08: 0000000000000001 R09: 0000000000000000
> [  147.261052] R10: 7635d840c31a8942 R11: fcca632b3d1b0d46 R12: ffff93a4f7831000
> [  147.261052] R13: ffff93a4eaf00ee0 R14: ffff93a4efd84178 R15: ffff93a4efd84000
> [  147.261053] FS:  0000000000000000(0000) GS:ffff93b396e00000(0000)
> knlGS:0000000000000000
> [  147.261054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  147.261055] CR2: 0000000000000088 CR3: 000000012a610000 CR4: 0000000000750ee0
> [  147.261056] PKRU: 55555554
> [  147.261056] Call Trace:
> [  147.261060]  <IRQ>
> [  147.261068]  _raw_spin_lock_bh+0x1d/0x80
> [  147.261074]  ieee80211_queue_skb+0x125/0x7a0 [mac80211]
> [  147.261113]  ? __skb_get_hash+0x55/0x200
> [  147.261117]  ieee80211_tx_8023+0x9c/0x1c0 [mac80211]
> [  147.261155]  ieee80211_subif_start_xmit_8023+0x2b5/0x510 [mac80211]
> [  147.261191]  netpoll_start_xmit+0x121/0x190
> [  147.261199]  netpoll_send_skb+0x1fc/0x300
> [  147.261202]  write_msg+0xdc/0xf0 [netconsole]
> [  147.261207]  console_emit_next_record.constprop.0+0x17d/0x300
> [  147.261214]  console_unlock+0xf3/0x1f0
> [  147.261215]  vprintk_emit+0x152/0x350
> [  147.261217]  ? plist_add+0xba/0xf0
> [  147.261223]  _printk+0x48/0x4e
> [  147.261231]  ? rcu_read_lock_sched_held+0x10/0x80
> [  147.261235]  page_fault_oops.cold+0xcf/0x1f9
> [  147.261240]  ? do_user_addr_fault+0x65/0x6b0
> [  147.261243]  ? _raw_spin_unlock_irqrestore+0x40/0x60
> [  147.261247]  exc_page_fault+0x7e/0x300
> [  147.261249]  asm_exc_page_fault+0x22/0x30
> [  147.261252] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x1e0 [gpu_sched]
> [  147.261255] Code: 89 d7 e8 87 02 0d f0 e9 54 ff ff ff 48 89 d7 e8
> ea 66 37 f0 e9 47 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 53
> 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d 70 02 00 00 48 8b 85 a8 03 00
> 00 f0
> [  147.261256] RSP: 0018:ffffa4e1c028cdc8 EFLAGS: 00010093
> [  147.261257] RAX: ffffffffc06dc380 RBX: 0000000000000000 RCX: 0000000000000018
> [  147.261257] RDX: 00000efa9afe3594 RSI: ffff93a7a4c1ec90 RDI: 0000000000000000
> [  147.261258] RBP: ffff93a7a4c1ee10 R08: 0000000000000001 R09: 0000000000000000
> [  147.261259] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa4e1c028cde8
> [  147.261259] R13: 0000000000000086 R14: 00000000ffffffff R15: ffff93a4fbed0198
> [  147.261261]  ? drm_sched_job_done.isra.0+0x1e0/0x1e0 [gpu_sched]
> [  147.261266]  dma_fence_signal_timestamp_locked+0x9e/0x1c0
> [  147.261274]  dma_fence_signal+0x36/0x70
> [  147.261276]  amdgpu_fence_process+0xd5/0x140 [amdgpu]
> [  147.261497]  sdma_v5_2_process_trap_irq+0x10d/0x130 [amdgpu]
> [  147.261737]  amdgpu_irq_dispatch+0xcc/0x270 [amdgpu]
> [  147.261918]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> [  147.262072]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> [  147.262227]  __handle_irq_event_percpu+0x93/0x330
> [  147.262231]  handle_irq_event+0x34/0x70
> [  147.262232]  handle_edge_irq+0x9f/0x240
> [  147.262233]  __common_interrupt+0x71/0x150
> [  147.262236]  common_interrupt+0xb4/0xd0
> [  147.262237]  </IRQ>
> [  147.262238]  <TASK>
>
> bisect points to this commit: e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
>
> $ git bisect bad
> e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 is the first bad commit
> commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> Author: Arvind Yadav <Arvind.Yadav at amd.com>
> Date:   Wed Sep 14 22:13:20 2022 +0530
>
>      drm/sched: Use parent fence instead of finished
>
>      Using the parent fence instead of the finished fence
>      to get the job status. This change is to avoid GPU
>      scheduler timeout error which can cause GPU reset.
>
>      Signed-off-by: Arvind Yadav <Arvind.Yadav at amd.com>
>      Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>      Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com
>      Signed-off-by: Christian König <christian.koenig at amd.com>
>
>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> Thanks.
>



More information about the amd-gfx mailing list