[Regression] CPU stalls and eventually causes a complete system freeze with 6.0.3 due to "video/aperture: Disable and unregister sysfb devices via aperture helpers"

Sun Oct 23 08:04:20 UTC 2022

Hi, this is your Linux kernel regression tracker speaking.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216616  :

>  Andreas 2022-10-22 14:25:32 UTC
> 
> Created attachment 303074 [details]
> dmesg
> 
> 6.0.2 works.
> 
> On 6.0.3 the system is very sluggish with graphic glitches all over the place in KDE Plasma Desktop X11 (no graphic glitches when using Wayland, but also sluggish). SDDM works fine.
> 
> Hardware: Lenovo Legion 5 Pro 16ACH6H: AMD Ryzen 7 5800H "Cezanne", hybrid graphics AMD "Green Sardine" (Vega 8 GCN 5.1, AMDGPU) and Nvidia GeForce RTX 3070 Mobile (GA104M, not working with nouveau, I'm not using the proprietary nvidia driver).
> 
> [reply] [−] Comment 1 Andreas 2022-10-22 14:27:15 UTC
> 
> Created attachment 303075 [details]
> my kernel .config for 6.0.3
> 
> Only was CONFIG_HID_TOPRE added in 6.0.3, otherwise it is identical as my .config for 6.0.2.
> 
> [reply] [−] Comment 2 Andreas 2022-10-22 14:51:23 UTC
> 
> In /var/log/Xorg.0.log the only obvious difference is the last line:
> ---- snap
> randr: falling back to unsynchronized pixmap sharing
> ---- snap
> The line is present when I boot with 6.0.3, but isn't when I boot 6.0.2.
> 
> (Obviously this is when I login to KDE with X11, not with Wayland, from SDDM.)
> 
> [reply] [−] Comment 3 Andreas 2022-10-22 22:10:19 UTC
> 
> I did a git bisect on stable kernels 5.0.3 as bad and 5.0.2 as good, this is the result:
> 
> cfecfc98a78d97a49807531b5b224459bda877de is the first bad commit
> commit cfecfc98a78d97a49807531b5b224459bda877de (HEAD, refs/bisect/bad)
> Author: Thomas Zimmermann <tzimmermann at suse.de>
> Date:   Mon Jul 18 09:23:18 2022 +0200
> 
>     video/aperture: Disable and unregister sysfb devices via aperture helpers
>     
>     [ Upstream commit 5e01376124309b4dbd30d413f43c0d9c2f60edea ]
>     
>     Call sysfb_disable() before removing conflicting devices in aperture
>     helpers. Fixes sysfb state if fbdev has been disabled.
>     
>     Signed-off-by: Thomas Zimmermann <tzimmermann at suse.de>
>     Reviewed-by: Javier Martinez Canillas <javierm at redhat.com>
>     Fixes: fb84efa28a48 ("drm/aperture: Run fbdev removal before internal helpers")
> 
> [reply] [−] Comment 4 Andreas 2022-10-22 22:11:51 UTC
> 
> Link to the suspect patch:
> 
> https://patchwork.freedesktop.org/patch/msgid/20220718072322.8927-8-tzimmermann@suse.de
> (or https://patchwork.freedesktop.org/patch/494608/)
> 
> [reply] [−] Comment 5 Andreas 2022-10-22 22:38:14 UTC
> 
> Okay, so I reverted v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch on stable 5.0.3 and the fault is gone.
> 
> I always logged out immediately, which worked (even though everything is very very sluggish). Also, when I killed the X session within a couple of seconds (15 or so), no error was shown (I used "systemctl stop sddm" from another virtual console).
> 
> Noteworthy: I once compiled a kernel from within the Plasma Desktop, while it was sluggish. The kernel compiled alright. When it was finished I moved the mouse to reboot, at which point it completely froze and I had to hard-reset the system.
> 
> While still running, after > 15 seconds, the fault looked like this (dmesg):
> ---- snap ----
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 7 jiffies s: 165 root: 0x2000/.
> rcu: blocking rcu_node structures (internal RCU debug):
> Task dump for CPU 13:
> task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x00000008
> Call Trace:
>  <TASK>
>  ? commit_tail+0xd7/0x130
>  ? drm_atomic_helper_commit+0x126/0x150
>  ? drm_atomic_commit+0xa4/0xe0
>  ? drm_plane_get_damage_clips.cold+0x1c/0x1c
>  ? drm_atomic_helper_dirtyfb+0x19e/0x280
>  ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? drm_ioctl_kernel+0xc4/0x150
>  ? drm_ioctl+0x246/0x3f0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? __x64_sys_ioctl+0x91/0xd0
>  ? do_syscall_64+0x60/0xd0
>  ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>  </TASK>
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 29 jiffies s: 165 root: 0x2000/.
> rcu: blocking rcu_node structures (internal RCU debug):
> Task dump for CPU 13:
> task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x00000008
> Call Trace:
>  <TASK>
>  ? commit_tail+0xd7/0x130
>  ? drm_atomic_helper_commit+0x126/0x150
>  ? drm_atomic_commit+0xa4/0xe0
>  ? drm_plane_get_damage_clips.cold+0x1c/0x1c
>  ? drm_atomic_helper_dirtyfb+0x19e/0x280
>  ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? drm_ioctl_kernel+0xc4/0x150
>  ? drm_ioctl+0x246/0x3f0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? __x64_sys_ioctl+0x91/0xd0
>  ? do_syscall_64+0x60/0xd0
>  ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>  </TASK>
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 8 jiffies s: 169 root: 0x2000/.
> rcu: blocking rcu_node structures (internal RCU debug):
> Task dump for CPU 13:
> task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
> Call Trace:
>  <TASK>
>  ? memcpy_toio+0x76/0xc0
>  ? drm_fb_memcpy_toio+0x76/0xb0
>  ? drm_fb_blit_toio+0x75/0x2b0
>  ? simpledrm_simple_display_pipe_update+0x132/0x150
>  ? drm_atomic_helper_commit_planes+0xb6/0x230
>  ? drm_atomic_helper_commit_tail+0x44/0x80
>  ? commit_tail+0xd7/0x130
>  ? drm_atomic_helper_commit+0x126/0x150
>  ? drm_atomic_commit+0xa4/0xe0
>  ? drm_plane_get_damage_clips.cold+0x1c/0x1c
>  ? drm_atomic_helper_dirtyfb+0x19e/0x280
>  ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? drm_ioctl_kernel+0xc4/0x150
>  ? drm_ioctl+0x246/0x3f0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? __x64_sys_ioctl+0x91/0xd0
>  ? do_syscall_64+0x60/0xd0
>  ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>  </TASK>
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 30 jiffies s: 169 root: 0x2000/.
> rcu: blocking rcu_node structures (internal RCU debug):
> Task dump for CPU 13:
> task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
> Call Trace:
>  <TASK>
>  ? memcpy_toio+0x76/0xc0
>  ? memcpy_toio+0x1b/0xc0
>  ? drm_fb_memcpy_toio+0x76/0xb0
>  ? drm_fb_blit_toio+0x75/0x2b0
>  ? simpledrm_simple_display_pipe_update+0x132/0x150
>  ? drm_atomic_helper_commit_planes+0xb6/0x230
>  ? drm_atomic_helper_commit_tail+0x44/0x80
>  ? commit_tail+0xd7/0x130
>  ? drm_atomic_helper_commit+0x126/0x150
>  ? drm_atomic_commit+0xa4/0xe0
>  ? drm_plane_get_damage_clips.cold+0x1c/0x1c
>  ? drm_atomic_helper_dirtyfb+0x19e/0x280
>  ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? drm_ioctl_kernel+0xc4/0x150
>  ? drm_ioctl+0x246/0x3f0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? __x64_sys_ioctl+0x91/0xd0
>  ? do_syscall_64+0x60/0xd0
>  ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>  </TASK>
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 52 jiffies s: 169 root: 0x2000/.
> rcu: blocking rcu_node structures (internal RCU debug):
> Task dump for CPU 13:
> task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
> Call Trace:
>  <TASK>
>  ? memcpy_toio+0x76/0xc0
>  ? memcpy_toio+0x1b/0xc0
>  ? drm_fb_memcpy_toio+0x76/0xb0
>  ? drm_fb_blit_toio+0x75/0x2b0
>  ? simpledrm_simple_display_pipe_update+0x132/0x150
>  ? drm_atomic_helper_commit_planes+0xb6/0x230
>  ? drm_atomic_helper_commit_tail+0x44/0x80
>  ? commit_tail+0xd7/0x130
>  ? drm_atomic_helper_commit+0x126/0x150
>  ? drm_atomic_commit+0xa4/0xe0
>  ? drm_plane_get_damage_clips.cold+0x1c/0x1c
>  ? drm_atomic_helper_dirtyfb+0x19e/0x280
>  ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? drm_ioctl_kernel+0xc4/0x150
>  ? drm_ioctl+0x246/0x3f0
>  ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
>  ? __x64_sys_ioctl+0x91/0xd0
>  ? do_syscall_64+0x60/0xd0
>  ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>  </TASK>
> traps: avahi-ml[4447] general protection fault ip:7fdde6a37bc1 sp:7fdde07fc920 error:0 in module-zeroconf-publish.so[7fdde6a37000+3000]
> 

See the ticket for more details.

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: cfecfc98a78d9
https://bugzilla.kernel.org/show_bug.cgi?id=216616
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.