[Nouveau] Lockup/panic caused by nouveau_fantog_update recursion

Ilia Mirkin imirkin at alum.mit.edu
Sat Jan 28 01:02:38 UTC 2017


Ben, perhaps time to apply this? It's been nearly 2 years, and no
better fix has come forward. As it is, we're hanging people's CPUs.
See https://bugs.freedesktop.org/show_bug.cgi?id=91413

On Mon, Mar 30, 2015 at 1:25 AM, 임성택 <stk.lim at samsung.com> wrote:
> Hi,
>
> I used to experience kernel panics caused by CPUx hard lockup once almost everyday.
> I'm running Ubuntu 14.10 on vanilla linux kernel 3.19.2 on Core i7-3770 with Gallium 0.4 on NV108.
> The panic log looked like:
>
> [ 9227.509744] ------------[ cut here ]------------
> [ 9227.509750] WARNING: CPU: 0 PID: 0 at kernel/watchdog.c:290 watchdog_overflow_callback+0x92/0xc0()
> [ 9227.509751] Watchdog detected hard LOCKUP on cpu 0
> [ 9227.509751] Modules linked in: rfcomm bnep bluetooth intel_rapl iosf_mbi x86_pkg_temp_thermal nouveau intel_powerclamp snd_hda_codec_hdmi coretemp mxm_wmi kvm_intel ttm drm_kms_helper kvm snd_hda_codec_realtek drm snd_hda_codec_generic snd_hda_intel crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel snd_hda_controller aesni_intel snd_hda_codec aes_x86_64 snd_hwdep lrw snd_pcm gf128mul shpchp wmi glue_helper snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq ablk_helper cryptd snd_seq_device snd_timer ppdev lp mei_me parport_pc snd mei parport mac_hid serio_raw soundcore tpm_infineon video lpc_ich hid_generic usbhid hid e1000e ahci ptp libahci pps_core
> [ 9227.509776] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.2-patched+ #1
> [ 9227.509777] Hardware name: SAMSUNG ELECTRONICS CO., LTD. Samsung DeskTop System/SAMSUNG_DT1234567890, BIOS P06KBM.025.130715.SK 07/15/2013
> [ 9227.509778]  ffffffff81aad4ca ffff88041dc05ab0 ffffffff817aa343 0000000000000007
> [ 9227.509779]  ffff88041dc05b00 ffff88041dc05af0 ffffffff8107349a 0000000000000000
> [ 9227.509780]  ffff88040d014800 0000000000000000 ffff88041dc05c40 ffff88041dc05ef8
> [ 9227.509782] Call Trace:
> [ 9227.509783]  <NMI>  [<ffffffff817aa343>] dump_stack+0x45/0x57
> [ 9227.509788]  [<ffffffff8107349a>] warn_slowpath_common+0x8a/0xc0
> [ 9227.509790]  [<ffffffff81073516>] warn_slowpath_fmt+0x46/0x50
> [ 9227.509791]  [<ffffffff811295b2>] watchdog_overflow_callback+0x92/0xc0
> [ 9227.509794]  [<ffffffff8116a79c>] __perf_event_overflow+0x8c/0x2a0
> [ 9227.509797]  [<ffffffff8102b9ba>] ? x86_perf_event_set_period+0xca/0x170
> [ 9227.509798]  [<ffffffff8116b2b4>] perf_event_overflow+0x14/0x20
> [ 9227.509800]  [<ffffffff81032fda>] intel_pmu_handle_irq+0x1ba/0x3a0
> [ 9227.509802]  [<ffffffff8102a6db>] perf_event_nmi_handler+0x2b/0x50
> [ 9227.509804]  [<ffffffff810184c0>] nmi_handle+0x80/0x120
> [ 9227.509805]  [<ffffffff81018a6a>] default_do_nmi+0x4a/0x140
> [ 9227.509806]  [<ffffffff81018be8>] do_nmi+0x88/0xd0
> [ 9227.509808]  [<ffffffff817b3ae1>] end_repeat_nmi+0x1e/0x2e
> [ 9227.509810]  [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509812]  [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509813]  [<ffffffff817b0f1a>] ? _raw_spin_lock_irqsave+0x4a/0x90
> [ 9227.509814]  <<EOE>>  <IRQ>  [<ffffffffc070e564>] nouveau_fantog_update+0x94/0x180 [nouveau]
> [ 9227.509838]  [<ffffffffc070e6a5>] nouveau_fantog_set+0x35/0x40 [nouveau]
> [ 9227.509847]  [<ffffffffc070d9f1>] nouveau_fan_update+0x101/0x220 [nouveau]
> [ 9227.509855]  [<ffffffffc070db69>] nouveau_therm_fan_set+0x19/0x20 [nouveau]
> [ 9227.509863]  [<ffffffffc070d1cd>] nouveau_therm_update+0xbd/0x340 [nouveau]
> [ 9227.509871]  [<ffffffffc06cdd28>] ? dcb_gpio_match+0x38/0x130 [nouveau]
> [ 9227.509879]  [<ffffffffc070d46a>] nouveau_therm_alarm+0x1a/0x20 [nouveau]
> [ 9227.509887]  [<ffffffffc0710e90>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [ 9227.509895]  [<ffffffffc0710f5b>] nv04_timer_alarm+0x7b/0xe0 [nouveau]
> [ 9227.509903]  [<ffffffffc070e62f>] nouveau_fantog_update+0x15f/0x180 [nouveau]
> [ 9227.509910]  [<ffffffffc070e66a>] nouveau_fantog_alarm+0x1a/0x20 [nouveau]
> [ 9227.509918]  [<ffffffffc0710e90>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [ 9227.509926]  [<ffffffffc071102b>] nv04_timer_intr+0x6b/0x90 [nouveau]
> [ 9227.509934]  [<ffffffffc070a6bc>] nouveau_mc_intr+0x12c/0x1b0 [nouveau]
> [ 9227.509936]  [<ffffffff810cc397>] handle_irq_event_percpu+0x77/0x1a0
> [ 9227.509938]  [<ffffffff810cc501>] handle_irq_event+0x41/0x70
> [ 9227.509939]  [<ffffffff810cf59e>] handle_edge_irq+0x6e/0x120
> [ 9227.509941]  [<ffffffff81016952>] handle_irq+0x22/0x40
> [ 9227.509943]  [<ffffffff817b453f>] do_IRQ+0x4f/0xf0
> [ 9227.509945]  [<ffffffff817b23ad>] common_interrupt+0x6d/0x6d
> [ 9227.509945]  <EOI>  [<ffffffff8164cf76>] ? cpuidle_enter_state+0x66/0x160
> [ 9227.509948]  [<ffffffff8164cf61>] ? cpuidle_enter_state+0x51/0x160
> [ 9227.509949]  [<ffffffff8164d157>] cpuidle_enter+0x17/0x20
> [ 9227.509951]  [<ffffffff810b5741>] cpu_startup_entry+0x351/0x400
> [ 9227.509953]  [<ffffffff8179f037>] rest_init+0x77/0x80
> [ 9227.509955]  [<ffffffff81d4cfce>] start_kernel+0x482/0x48f
> [ 9227.509957]  [<ffffffff81d4c120>] ? early_idt_handlers+0x120/0x120
> [ 9227.509958]  [<ffffffff81d4c4d7>] x86_64_start_reservations+0x2a/0x2c
> [ 9227.509960]  [<ffffffff81d4c61c>] x86_64_start_kernel+0x143/0x152
> [ 9227.509961] ---[ end trace 61afb193cdcd4263 ]---
>
> It looked like there was a recursion between 'nouveau_fantog_update' and 'nv04_timer_alarm' which invoked spinlock lockup as a consequence.
> I tried a quick fix:
>
> diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c b/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> index f69dab1..eec0cf3 100644
> --- a/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> +++ b/drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c
> @@ -48,7 +48,8 @@ nouveau_fantog_update(struct nouveau_fantog_priv *priv, int percent)
>         unsigned long flags;
>         int duty;
>
> -       spin_lock_irqsave(&priv->lock, flags);
> +       if (WARN_ON(!spin_trylock_irqsave(&priv->lock, flags)))
> +               return;
>         if (percent < 0)
>                 percent = priv->percent;
>         priv->percent = percent;
>
> After applying above patch, my desktop is running fine for 3 days with one warning:
>
> [10699.264606] ------------[ cut here ]------------
> [10699.264654] WARNING: CPU: 5 PID: 0 at drivers/gpu/drm/nouveau/core/subdev/therm/fantog.c:51 nouveau_fantog_update+0xc4/0x1c0 [nouveau]()
> [10699.264657] Modules linked in: bnep rfcomm bluetooth snd_hda_codec_hdmi intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm crct10dif_pclmul snd_hda_intel snd_hda_controller snd_hda_codec crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 snd_hwdep lrw gf128mul snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq glue_helper snd_seq_device nouveau ppdev mei_me parport_pc snd_timer lp mxm_wmi ablk_helper ttm drm_kms_helper parport drm snd mei cryptd mac_hid soundcore i2c_algo_bit shpchp tpm_infineon serio_raw wmi lpc_ich video hid_generic usbhid hid e1000e ahci ptp libahci pps_core
> [10699.264714] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.19.2-patched+ #3
> [10699.264717] Hardware name: SAMSUNG ELECTRONICS CO., LTD. Samsung DeskTop System/SAMSUNG_DT1234567890, BIOS P06KBM.025.130715.SK 07/15/2013
> [10699.264720]  ffffffffc0512da0 ffff88041dd43b38 ffffffff817aa343 0000000000000007
> [10699.264725]  0000000000000000 ffff88041dd43b78 ffffffff8107349a 0000000000000014
> [10699.264729]  0000000000000086 0000000000000014 ffff8803fd780308 ffff8803fbe0e180
> [10699.264733] Call Trace:
> [10699.264736]  <IRQ>  [<ffffffff817aa343>] dump_stack+0x45/0x57
> [10699.264751]  [<ffffffff8107349a>] warn_slowpath_common+0x8a/0xc0
> [10699.264757]  [<ffffffff8107358a>] warn_slowpath_null+0x1a/0x20
> [10699.264787]  [<ffffffffc046c594>] nouveau_fantog_update+0xc4/0x1c0 [nouveau]
> [10699.264792]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264821]  [<ffffffffc046c6e5>] nouveau_fantog_set+0x35/0x40 [nouveau]
> [10699.264825]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264849]  [<ffffffffc046b9f1>] nouveau_fan_update+0x101/0x220 [nouveau]
> [10699.264873]  [<ffffffffc046b8f5>] ? nouveau_fan_update+0x5/0x220 [nouveau]
> [10699.264878]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264902]  [<ffffffffc046bb69>] nouveau_therm_fan_set+0x19/0x20 [nouveau]
> [10699.264907]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.264932]  [<ffffffffc046b1cd>] nouveau_therm_update+0xbd/0x340 [nouveau]
> [10699.264954]  [<ffffffffc046b455>] ? nouveau_therm_alarm+0x5/0x20 [nouveau]
> [10699.264959]  [<ffffffff817b3e28>] ? ftrace_graph_caller+0xa8/0xa8
> [10699.264983]  [<ffffffffc046b46a>] nouveau_therm_alarm+0x1a/0x20 [nouveau]
> [10699.264989]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265013]  [<ffffffffc046eed0>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [10699.265018]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265043]  [<ffffffffc046ef9b>] nv04_timer_alarm+0x7b/0xe0 [nouveau]
> [10699.265048]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265072]  [<ffffffffc046c66d>] nouveau_fantog_update+0x19d/0x1c0 [nouveau]
> [10699.265077]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265100]  [<ffffffffc046c6aa>] nouveau_fantog_alarm+0x1a/0x20 [nouveau]
> [10699.265105]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265130]  [<ffffffffc046eed0>] nv04_timer_alarm_trigger+0x120/0x170 [nouveau]
> [10699.265134]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265158]  [<ffffffffc046f06b>] nv04_timer_intr+0x6b/0x90 [nouveau]
> [10699.265183]  [<ffffffffc046f005>] ? nv04_timer_intr+0x5/0x90 [nouveau]
> [10699.265188]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265212]  [<ffffffffc04686bc>] nouveau_mc_intr+0x12c/0x1b0 [nouveau]
> [10699.265218]  [<ffffffff817b3e28>] ftrace_graph_caller+0xa8/0xa8
> [10699.265225]  [<ffffffff810cc397>] handle_irq_event_percpu+0x77/0x1a0
> [10699.265229]  [<ffffffff810cc501>] handle_irq_event+0x41/0x70
> [10699.265235]  [<ffffffff810cf59e>] handle_edge_irq+0x6e/0x120
> [10699.265241]  [<ffffffff81016952>] handle_irq+0x22/0x40
> [10699.265246]  [<ffffffff817b453f>] do_IRQ+0x4f/0xf0
> [10699.265255]  [<ffffffff817b23ad>] common_interrupt+0x6d/0x6d
> [10699.265257]  <EOI>  [<ffffffff8164cf76>] ? cpuidle_enter_state+0x66/0x160
> [10699.265263]  [<ffffffff8164cf61>] ? cpuidle_enter_state+0x51/0x160
> [10699.265266]  [<ffffffff8164d157>] cpuidle_enter+0x17/0x20
> [10699.265270]  [<ffffffff810b5741>] cpu_startup_entry+0x351/0x400
> [10699.265275]  [<ffffffff81048417>] start_secondary+0x197/0x1c0
> [10699.265278] ---[ end trace 25fb71c36f4ba9fa ]---
>
> Now I think I'm avoiding the panics though I don't think the above patch would fix the root cause of the problem.
>
> Thank you.
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau


More information about the Nouveau mailing list