rcu_sched detected expedited stalls in amdgpu after suspend
Paul E. McKenney
paulmck at kernel.org
Mon Jun 27 20:41:39 UTC 2022
On Mon, Jun 27, 2022 at 03:22:24PM -0400, Alex Xu (Hello71) wrote:
> Hi,
>
> Since Linux 5.19-ish, I consistently get these types of errors when
> resuming from S3:
>
> [15652.909157] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9981 root: 0x800/.
> [15652.909162] rcu: blocking rcu_node structures (internal RCU debug):
> [15652.909163] Task dump for CPU 11:
> [15652.909164] task:kworker/u24:65 state:R running task stack: 0 pid:210218 ppid: 2 flags:0x00004008
> [15652.909167] Workqueue: events_unbound async_run_entry_fn
> [15652.909172] Call Trace:
> [15652.909173] <TASK>
> [15652.909174] ? atom_get_src_int+0x38e/0x680
> [15652.909179] ? atom_op_test+0x67/0x190
> [15652.909181] ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909184] ? atom_op_calltable+0xb1/0x110
> [15652.909186] ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909189] ? atom_op_calltable+0xb1/0x110
> [15652.909191] ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909193] ? __switch_to+0x137/0x440
> [15652.909195] ? amdgpu_atom_asic_init+0xe0/0x100
> [15652.909198] ? pci_bus_read_config_dword+0x36/0x50
> [15652.909201] ? amdgpu_device_resume+0x10b/0x3e0
> [15652.909203] ? amdgpu_pmops_resume+0x32/0x60
> [15652.909204] ? pci_pm_suspend+0x2b0/0x2b0
> [15652.909206] ? dpm_run_callback+0x35/0x1f0
> [15652.909209] ? device_resume+0x1ca/0x220
> [15652.909211] ? async_resume+0x19/0xe0
> [15652.909213] ? async_run_entry_fn+0x33/0x120
> [15652.909215] ? process_one_work+0x1d6/0x350
> [15652.909218] ? worker_thread+0x24d/0x480
> [15652.909220] ? kthread+0x137/0x150
> [15652.909221] ? worker_clr_flags+0x40/0x40
> [15652.909224] ? kthread_blkcg+0x30/0x30
> [15652.909226] ? ret_from_fork+0x22/0x30
> [15652.909227] </TASK>
> [15653.015808] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9985 root: 0x800/.
> [15653.015812] rcu: blocking rcu_node structures (internal RCU debug):
> [15653.015813] Task dump for CPU 11:
> [15653.015813] task:kworker/u24:65 state:R running task stack: 0 pid:210218 ppid: 2 flags:0x00004008
> [15653.015816] Workqueue: events_unbound async_run_entry_fn
> [15653.015820] Call Trace:
> [15653.015820] <TASK>
> [15653.015821] ? amdgpu_cgs_read_register+0x10/0x10
> [15653.015825] ? smu7_copy_bytes_to_smc+0xd4/0x200
> [15653.015828] ? polaris10_program_memory_timing_parameters+0x195/0x1b0
> [15653.015831] ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015834] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015836] ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015839] ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015841] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015843] ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015846] ? amdgpu_device_rreg+0x8f/0xd0
> [15653.015847] ? phm_wait_for_register_unequal+0x99/0xd0
> [15653.015850] ? smu7_send_msg_to_smc+0x95/0x130
> [15653.015853] ? smum_send_msg_to_smc+0x5d/0xa0
> [15653.015854] ? amdgpu_cgs_read_ind_register+0xa0/0xa0
> [15653.015857] ? smu7_enable_dpm_tasks+0x241f/0x28c0
> [15653.015859] ? hwmgr_resume+0x31/0x70
> [15653.015861] ? amdgpu_device_resume+0x1fa/0x3e0
> [15653.015863] ? amdgpu_pmops_resume+0x32/0x60
> [15653.015864] ? pci_pm_suspend+0x2b0/0x2b0
> [15653.015866] ? dpm_run_callback+0x35/0x1f0
> [15653.015868] ? device_resume+0x1ca/0x220
> [15653.015870] ? async_resume+0x19/0xe0
> [15653.015872] ? async_run_entry_fn+0x33/0x120
> [15653.015874] ? process_one_work+0x1d6/0x350
> [15653.015877] ? worker_thread+0x24d/0x480
> [15653.015878] ? kthread+0x137/0x150
> [15653.015880] ? worker_clr_flags+0x40/0x40
> [15653.015882] ? kthread_blkcg+0x30/0x30
> [15653.015884] ? ret_from_fork+0x22/0x30
> [15653.015886] </TASK>
>
> I have not noticed any resulting problems. I am reporting this in the
> hope that it is easy to fix the issue and remove the error messages
> which may obscure some future problem.
The usual way this happens is for a task to be spinning. In this case,
that might be due to excessive lock contention on async_lock within the
async_run_entry_fn() function. Or perhaps the problem is in one of the
functions preceded by "?" above.
One way to debug this is to place trace_printk()s or similar in the
functions called out above to track it down.
I bet that the reason this showed up in v5.19 is this guy:
28b3ae426598 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")
So do you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to some small
number of milliseconds? If so, you can override this by adjusting the
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option or by booting with a
longer timeout via the rcupdate.rcu_exp_cpu_stall_timeout= kernel boot
parameter.
But if you are running on an Android platform, Uladzislau will be
interested in addressing the underlying issue, so I have added him on CC.
Thanx, Paul
More information about the amd-gfx
mailing list