[Nouveau] Lockup/panic caused by nouveau_fantog_update recursion
Ilia Mirkin
imirkin at alum.mit.edu
Sat Jan 28 01:02:38 UTC 2017
Ben, perhaps time to apply this? It's been nearly 2 years, and no
better fix has come forward. As it is, we're hanging people's CPUs.
See https://bugs.freedesktop.org/show_bug.cgi?id=91413
On Mon, Mar 30, 2015 at 1:25 AM, 임성택 <stk.lim at samsung.com> wrote:
> Hi,
>
> I used to experience kernel panics caused by CPUx hard lockup once almost everyday.
> I'm running Ubuntu 14.10 on vanilla linux kernel 3.19.2 on Core i7-3770 with Gallium 0.4 on NV108.
> The panic log looked like:
>
> [ 9227.509744] ------------[ cut here ]------------
> [ 9227.509750] WARNING: CPU: 0 PID: 0 at kernel/watchdog.c:290 watchdog_overflow_callback+0x92/0xc0()
> [ 9227.509751] Watchdog detected hard LOCKUP on cpu 0
> [ 9227.509751] Modules linked in: rfcomm bnep bluetooth intel_rapl iosf_mbi x86_pkg_temp_thermal nouveau intel_powerclamp snd_hda_codec_hdmi coretemp mxm_wmi kvm_intel ttm drm_kms_helper kvm snd_hda_codec_realtek drm snd_hda_codec_generic snd_hda_intel crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel snd_hda_controller aesni_intel snd_hda_codec aes_x86_64 snd_hwdep lrw snd_pcm gf128mul shpchp wmi glue_helper snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq ablk_helper cryptd snd_seq_device snd_timer ppdev lp mei_me parport_pc snd mei parport mac_hid serio_raw soundcore tpm_infineon video lpc_ich hid_generic usbhid hid e1000e ahci ptp libahci pps_core
> [ 9227.509776] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.2-patched+ #1
> [ 9227.509777] Hardware name: SAMSUNG ELECTRONICS CO., LTD. Samsung DeskTop System/SAMSUNG_DT1234567890, BIOS P06KBM.025.130715.SK 07/15/2013
> [ 9227.509778] ffffffff81aad4ca ffff88041dc05ab0 ffffffff817aa343 0000000000000007
> [ 9227.509779] ffff88041dc05b00 ffff88041dc05af0 ffffffff8107349a 0000000000000000
> [ 9227.509780] ffff88040d014800 0000000000000000 ffff88041dc05c40 ffff88041dc05ef8
> [ 9227.509782] Call Trace:
> [ 9227.509783] <NMI> [<ffffffff817aa343>] dump_stack+0x45/0x57
> [ 9227.509788] [<ffffffff8107349a>] warn_slowpath_common+0x8a/0xc0
> [ 9227.509790] [<ffffffff81073516>] warn_slowpath_fmt+0x46/0x50
> [ 9227.509791] [<ffffffff811295b2>] watchdog_overflow_callback+0x92/0xc0
> [ 9227.509794] [<ffffffff8116a79c>] __perf_event_overflow+0x8c/0x2a0
> [ 9227.509797] [<ffffffff8102b9ba>] ? x86_perf_event_set_period+0xca/0x170
> [ 9227.509798] [<ffffffff8116b2b4>] perf_event_overflow+0x14/0x20
> [ 9227.509800] [<ffffffff81032fda>] intel_pmu_handle_irq+0x1ba/0x3a0
> [ 9227.509802] [<ffffffff8102a6db>] perf_event_nmi_handler+0x2b/0x50
> [ 9227.509804] [<ffffffff810184c0>] nmi_handle+0x80/0x120
> [ 9227.509805] [<ffffffff81018a6a>] default_do_nmi+0x4a/0x140
> [ 9227.509806] [<ffffffff81018be8>] do_nmi+0x88/0xd0
> [ 9227.509808] [<ffffffff817b3ae1>] end_repeat_nmi+0x1e/0x2e
> [ 9227.509810] [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509812] [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509813] [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509814] <<EOE>> <IRQ> [<ffffffffc070e564>] nouveau_fantog_update+0x94/0x180 [nouveau]
> [ 9227.509838] [<ffffffffc070e6a5>] nouveau_fantog_set+0x35/0x40 [nouveau]
> [ 9227.509847] [<ffffffffc070d9f1>] nouveau_fan_update+0x101/0x220 [nouveau]
> [ 9227.509855] [<ffffffffc070db69>] nouveau_therm_fan_set+0x19/0x20 [nouveau]
> [ 9227.509863] [<ffffffffc070d1cd>] nouveau_therm_update+0xbd/0x340 [nouveau]
> [ 9227.509871] [<ffffffffc06cdd28>] ? dcb_gpio_match+0x38/0x130 [nouveau]
> [ 9227.509879] [<ffffffffc070d46a>] nouveau_therm_alarm+0x1a/0x20 [nouveau]
> [ 9227.509887] [<ffffffffc0710e90>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [ 9227.509895] [<ffffffffc0710f5b>] nv04_timer_alarm+0x7b/0xe0 [nouveau]
> [ 9227.509903] [<ffffffffc070e62f>] nouveau_fantog_update+0x15f/0x180 [nouveau]
> [ 9227.509910] [<ffffffffc070e66a>] nouveau_fantog_alarm+0x1a/0x20 [nouveau]
> [ 9227.509918] [<ffffffffc0710e90>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [ 9227.509926] [<ffffffffc071102b>] nv04_timer_intr+0x6b/0x90 [nouveau]
> [ 9227.509934] [<ffffffffc070a6bc>] nouveau_mc_intr+0x12c/0x1b0 [nouveau]
> [ 9227.509936] [<ffffffff810cc397>] handle_irq_event_percpu+0x77/0x1a0
> [ 9227.509938] [<ffffffff810cc501>] handle_irq_event+0x41/0x70
> [ 9227.509939] [<ffffffff810cf59e>] handle_edge_irq+0x6e/0x120
> [ 9227.509941] [<ffffffff81016952>] handle_irq+0x22/0x40
> [ 9227.509943] [<ffffffff817b453f>] do_IRQ+0x4f/0xf0
> [ 9227.509945] [<ffffffff817b23ad>] common_interrupt+0x6d/0x6d
> [ 9227.509945] <EOI> [<ffffffff8164cf76>] ? cpuidle_enter_state+0x66/0x160
> [ 9227.509948] [<ffffffff8164cf61>] ? cpuidle_enter_state+0x51/0x160
> [ 9227.509949] [<ffffffff8164d157>] cpuidle_enter+0x17/0x20
> [ 9227.509951] [<ffffffff810b5741>] cpu_startup_entry+0x351/0x400
> [ 9227.509953] [<ffffffff8179f037>] rest_init+0x77/0x80
> [ 9227.509955] [<ffffffff81d4cfce>] start_kernel+0x482/0x48f
> [ 9227.509957] [<ffffffff81d4c120>] ? early_idt_handlers+0x120/0x120
> [ 9227.509958] [<ffffffff81d4c4d7>] x86_64_start_reservations+0x2a/0x2c
> [ 9227.509960] [<ffffffff81d4c61c>] x86_64_start_kernel+0x143/0x152
> [ 9227.509961] ---[ end trace 61afb193cdcd4263 ]---
>
> It looked like there was a recursion between 'nouveau_fantog_update' and 'nv04_timer_alarm' which invoked spinlock lockup as a consequence.
> I tried a quick fix:
>
> diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c b/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> index f69dab1..eec0cf3 100644
> --- a/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> +++ b/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> @@ -48,7 +48,8 @@ nouveau_fantog_update(struct nouveau_fantog_priv *priv, int percent)
> unsigned long flags;
> int duty;
>
> - spin_lock_irqsave(&priv->lock, flags);
> + if (WARN_ON(!spin_trylock_irqsave(&priv->lock, flags)))
> + return;
> if (percent < 0)
> percent = priv->percent;
> priv->percent = percent;
>
> After applying above patch, my desktop is running fine for 3 days with one warning:
>
> [10699.264606] ------------[ cut here ]------------
> [10699.264654] WARNING: CPU: 5 PID: 0 at drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c:51 nouveau_fantog_update+0xc4/0x1c0 [nouveau]()
> [10699.264657] Modules linked in: bnep rfcomm bluetooth snd_hda_codec_hdmi intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm crct10dif_pclmul snd_hda_intel snd_hda_controller snd_hda_codec crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 snd_hwdep lrw gf128mul snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq glue_helper snd_seq_device nouveau ppdev mei_me parport_pc snd_timer lp mxm_wmi ablk_helper ttm drm_kms_helper parport drm snd mei cryptd mac_hid soundcore i2c_algo_bit shpchp tpm_infineon serio_raw wmi lpc_ich video hid_generic usbhid hid e1000e ahci ptp libahci pps_core
> [10699.264714] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.19.2-patched+ #3
> [10699.264717] Hardware name: SAMSUNG ELECTRONICS CO., LTD. Samsung DeskTop System/SAMSUNG_DT1234567890, BIOS P06KBM.025.130715.SK 07/15/2013
> [10699.264720] ffffffffc0512da0 ffff88041dd43b38 ffffffff817aa343 0000000000000007
> [10699.264725] 0000000000000000 ffff88041dd43b78 ffffffff8107349a 0000000000000014
> [10699.264729] 0000000000000086 0000000000000014 ffff8803fd780308 ffff8803fbe0e180
> [10699.264733] Call Trace:
> [10699.264736] <IRQ> [<ffffffff817aa343>] dump_stack+0x45/0x57
> [10699.264751] [<ffffffff8107349a>] warn_slowpath_common+0x8a/0xc0
> [10699.264757] [<ffffffff8107358a>] warn_slowpath_null+0x1a/0x20
> [10699.264787] [<ffffffffc046c594>] nouveau_fantog_update+0xc4/0x1c0 [nouveau]
> [10699.264792] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264821] [<ffffffffc046c6e5>] nouveau_fantog_set+0x35/0x40 [nouveau]
> [10699.264825] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264849] [<ffffffffc046b9f1>] nouveau_fan_update+0x101/0x220 [nouveau]
> [10699.264873] [<ffffffffc046b8f5>] ? nouveau_fan_update+0x5/0x220 [nouveau]
> [10699.264878] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264902] [<ffffffffc046bb69>] nouveau_therm_fan_set+0x19/0x20 [nouveau]
> [10699.264907] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264932] [<ffffffffc046b1cd>] nouveau_therm_update+0xbd/0x340 [nouveau]
> [10699.264954] [<ffffffffc046b455>] ? nouveau_therm_alarm+0x5/0x20 [nouveau]
> [10699.264959] [<ffffffff817b3e28>] ? ftrace_graph_caller+0xa8/0xa8
> [10699.264983] [<ffffffffc046b46a>] nouveau_therm_alarm+0x1a/0x20 [nouveau]
> [10699.264989] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265013] [<ffffffffc046eed0>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [10699.265018] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265043] [<ffffffffc046ef9b>] nv04_timer_alarm+0x7b/0xe0 [nouveau]
> [10699.265048] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265072] [<ffffffffc046c66d>] nouveau_fantog_update+0x19d/0x1c0 [nouveau]
> [10699.265077] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265100] [<ffffffffc046c6aa>] nouveau_fantog_alarm+0x1a/0x20 [nouveau]
> [10699.265105] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265130] [<ffffffffc046eed0>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [10699.265134] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265158] [<ffffffffc046f06b>] nv04_timer_intr+0x6b/0x90 [nouveau]
> [10699.265183] [<ffffffffc046f005>] ? nv04_timer_intr+0x5/0x90 [nouveau]
> [10699.265188] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265212] [<ffffffffc04686bc>] nouveau_mc_intr+0x12c/0x1b0 [nouveau]
> [10699.265218] [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265225] [<ffffffff810cc397>] handle_irq_event_percpu+0x77/0x1a0
> [10699.265229] [<ffffffff810cc501>] handle_irq_event+0x41/0x70
> [10699.265235] [<ffffffff810cf59e>] handle_edge_irq+0x6e/0x120
> [10699.265241] [<ffffffff81016952>] handle_irq+0x22/0x40
> [10699.265246] [<ffffffff817b453f>] do_IRQ+0x4f/0xf0
> [10699.265255] [<ffffffff817b23ad>] common_interrupt+0x6d/0x6d
> [10699.265257] <EOI> [<ffffffff8164cf76>] ? cpuidle_enter_state+0x66/0x160
> [10699.265263] [<ffffffff8164cf61>] ? cpuidle_enter_state+0x51/0x160
> [10699.265266] [<ffffffff8164d157>] cpuidle_enter+0x17/0x20
> [10699.265270] [<ffffffff810b5741>] cpu_startup_entry+0x351/0x400
> [10699.265275] [<ffffffff81048417>] start_secondary+0x197/0x1c0
> [10699.265278] ---[ end trace 25fb71c36f4ba9fa ]---
>
> Now I think I'm avoiding the panics though I don't think the above patch would fix the root cause of the problem.
>
> Thank you.
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
More information about the Nouveau
mailing list