[PATCH] drm/sched: Check scheduler ready before calling timeout handling

Christian König ckoenig.leichtzumerken at gmail.com
Wed Apr 12 09:35:48 UTC 2023


Am 12.04.23 um 02:00 schrieb Luben Tuikov:
> On 2023-04-11 17:39, Alex Deucher wrote:
>> On Thu, Apr 6, 2023 at 4:01 PM Luben Tuikov <luben.tuikov at amd.com> wrote:
>>> From: Vitaly Prosyak <vitaly.prosyak at amd.com>
>>>
>>> During an IGT GPU reset test we see the following oops,
>>>
>>> [  +0.000003] ------------[ cut here ]------------
>>> [  +0.000000] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:1656 __queue_delayed_work+0x6d/0xa0
>>> [  +0.000004] Modules linked in: iptable_filter bpfilter amdgpu(OE) nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr ledtrig_audio snd_hda_codec_hdmi intel_rapl_common snd_hda_intel edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core iommu_v2 gpu_sched(OE) kvm_amd drm_buddy snd_hwdep kvm video drm_ttm_helper snd_pcm ttm snd_seq_midi drm_display_helper snd_seq_midi_event snd_rawmidi cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_seq aesni_intel rc_core crypto_simd cryptd binfmt_misc drm_kms_helper rapl snd_seq_device input_leds joydev snd_timer i2c_algo_bit syscopyarea snd ccp sysfillrect sysimgblt wmi_bmof k10temp soundcore mac_hid sch_fq_codel msr parport_pc ppdev drm lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid r8169 ahci xhci_pci gpio_amdpt realtek i2c_piix4 wmi crc32_pclmul xhci_pci_renesas libahci gpio_generic
>>> [  +0.000070] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W OE      6.1.11+ #2
>>> [  +0.000003] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
>>> [  +0.000001] RIP: 0010:__queue_delayed_work+0x6d/0xa0
>>> [  +0.000003] Code: 7a 50 48 01 c1 48 89 4a 30 81 ff 00 20 00 00 75 38 4c 89 cf e8 64 3e 0a 00 5d e9 1e c5 11 01 e8 99 f7 ff ff 5d e9 13 c5 11 01 <0f> 0b eb c1 0f 0b 48 81 7a 38 70 5c 0e 81 74 9f 0f 0b 48 8b 42 28
>>> [  +0.000002] RSP: 0018:ffffc90000398d60 EFLAGS: 00010007
>>> [  +0.000002] RAX: ffff88810d589c60 RBX: 0000000000000000 RCX: 0000000000000000
>>> [  +0.000002] RDX: ffff88810d589c58 RSI: 0000000000000000 RDI: 0000000000002000
>>> [  +0.000001] RBP: ffffc90000398d60 R08: 0000000000000000 R09: ffff88810d589c78
>>> [  +0.000002] R10: 72705f305f39765f R11: 7866673a6d72645b R12: ffff88810d589c58
>>> [  +0.000001] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000000000
>>> [  +0.000002] FS:  0000000000000000(0000) GS:ffff8887fee40000(0000) knlGS:0000000000000000
>>> [  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.000002] CR2: 00005562c4797fa0 CR3: 0000000110da0000 CR4: 00000000003506e0
>>> [  +0.000002] Call Trace:
>>> [  +0.000001]  <IRQ>
>>> [  +0.000001]  mod_delayed_work_on+0x5e/0xa0
>>> [  +0.000004]  drm_sched_fault+0x23/0x30 [gpu_sched]
>>> [  +0.000007]  gfx_v9_0_fault.isra.0+0xa6/0xd0 [amdgpu]
>>> [  +0.000258]  gfx_v9_0_priv_reg_irq+0x29/0x40 [amdgpu]
>>> [  +0.000254]  amdgpu_irq_dispatch+0x1ac/0x2b0 [amdgpu]
>>> [  +0.000243]  amdgpu_ih_process+0x89/0x130 [amdgpu]
>>> [  +0.000245]  amdgpu_irq_handler+0x24/0x60 [amdgpu]
>>> [  +0.000165]  __handle_irq_event_percpu+0x4f/0x1a0
>>> [  +0.000003]  handle_irq_event_percpu+0x15/0x50
>>> [  +0.000001]  handle_irq_event+0x39/0x60
>>> [  +0.000002]  handle_edge_irq+0xa8/0x250
>>> [  +0.000003]  __common_interrupt+0x7b/0x150
>>> [  +0.000002]  common_interrupt+0xc1/0xe0
>>> [  +0.000003]  </IRQ>
>>> [  +0.000000]  <TASK>
>>> [  +0.000001]  asm_common_interrupt+0x27/0x40
>>> [  +0.000002] RIP: 0010:native_safe_halt+0xb/0x10
>>> [  +0.000003] Code: 46 ff ff ff cc cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 f2 5e 00 f4 e9 f1 3b 3e 00 90 eb 07 0f 00 2d 59 f2 5e 00 fb f4 <e9> e0 3b 3e 00 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 d4 fe ff 66 90
>>> [  +0.000002] RSP: 0018:ffffc9000018fdc8 EFLAGS: 00000246
>>> [  +0.000002] RAX: 0000000000004000 RBX: 000000000002e5a8 RCX: 000000000000001f
>>> [  +0.000001] RDX: 0000000000000001 RSI: ffff888101298800 RDI: ffff888101298864
>>> [  +0.000001] RBP: ffffc9000018fdd0 R08: 000000527f64bd8b R09: 000000000001dc90
>>> [  +0.000001] R10: 000000000001dc90 R11: 0000000000000003 R12: 0000000000000001
>>> [  +0.000001] R13: ffff888101298864 R14: ffffffff832d9e20 R15: ffff888193aa8c00
>>> [  +0.000003]  ? acpi_idle_do_entry+0x5e/0x70
>>> [  +0.000002]  acpi_idle_enter+0xd1/0x160
>>> [  +0.000003]  cpuidle_enter_state+0x9a/0x6e0
>>> [  +0.000003]  cpuidle_enter+0x2e/0x50
>>> [  +0.000003]  call_cpuidle+0x23/0x50
>>> [  +0.000002]  do_idle+0x1de/0x260
>>> [  +0.000002]  cpu_startup_entry+0x20/0x30
>>> [  +0.000002]  start_secondary+0x120/0x150
>>> [  +0.000003]  secondary_startup_64_no_verify+0xe5/0xeb
>>> [  +0.000004]  </TASK>
>>> [  +0.000000] ---[ end trace 0000000000000000 ]---
>>> [  +0.000003] BUG: kernel NULL pointer dereference, address: 0000000000000102
>>> [  +0.006233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=3, emitted seq=4
>>> [  +0.000734] #PF: supervisor read access in kernel mode
>>> [  +0.009670] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amd_deadlock pid 2002 thread amd_deadlock pid 2002
>>> [  +0.005135] #PF: error_code(0x0000) - not-present page
>>> [  +0.000002] PGD 0 P4D 0
>>> [  +0.000002] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> [  +0.000002] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W OE      6.1.11+ #2
>>> [  +0.000002] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
>>> [  +0.012101] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
>>> [  +0.005136] RIP: 0010:__queue_work+0x1f/0x4e0
>>> [  +0.000004] Code: 87 cd 11 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 83 ec 10 89 7d d4 <f6> 86 02 01 00 00 01 0f 85 6c 03 00 00 e8 7f 36 08 00 8b 45 d4 48
>>>
>>> For gfx_rings the schedulers may not be initialized by
>>> amdgpu_device_init_schedulers() due to ring->no_scheduler flag being set to
>>> true and thus the timeout_wq is NULL. As a result, since all ASICs call
>>> drm_sched_fault() unconditionally even for schedulers which have not been
>>> initialized, it is simpler to use the ready condition which indicates whether
>>> the given scheduler worker thread runs and whether the timeout_wq of the reset
>>> domain has been initialized.
>>>
>>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>
>>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>> Please make sure this lands in drm-misc.  Generally all scheduler
>> patches should go through drm-misc unless there is a particular driver
>> dependency.
> drm-misc-next, drm-misc-fixes or both?

I think drm-misc-fixes should be appropriate.

Christian.


More information about the amd-gfx mailing list