[Bug 215315] New: [REGRESSION BISECTED] amdgpu crashes system suspend - NUC8i7HVKVA

Thorsten Leemhuis regressions at leemhuis.info
Mon Dec 13 06:04:14 UTC 2021


[TLDR: adding this regression to regzbot; most of this mail is compiled
from a few templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Top-posting for once, to make this easy accessible to everyone.

Thanks for the report.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced f7d6779df642720e22bffd449e683bb8690bd3bf
#regzbot title drm: amdgpu: NUC8i7HVKVA crashes during system suspend
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215315
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail), then regzbot will automatically
mark the regression as resolved once the fix lands in the appropriate
tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat).

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave thus might sent someone reading this down the wrong
rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.


On 13.12.21 00:08, bugzilla-daemon at bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215315
> 
>             Bug ID: 215315
>            Summary: [REGRESSION BISECTED] amdgpu crashes system suspend -
>                     NUC8i7HVKVA
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.15-rc1, 5.15, 5.16-rc4
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Video(DRI - non Intel)
>           Assignee: drivers_video-dri at kernel-bugs.osdl.org
>           Reporter: lenb at kernel.org
>         Regression: No
> 
> My Intel NUC8i7HVKVA has an AMD GPU.
> 
> Until 5.15-rc1, this machine was rock solid in suspend stress testing -- never
> crashing after hundreds of hours of back-to-back suspend cycles.
> 
> Until this patch went upstream:
> 
> commit f7d6779df642720e22bffd449e683bb8690bd3bf (refs/bisect/bad)
> Author: Guchun Chen <guchun.chen at amd.com>
> Date:   Fri Aug 27 18:31:41 2021 +0800
> 
>     drm/amdgpu: stop scheduler when calling hw_fini (v2)
> 
>     This gurantees no more work on the ring can be submitted
>     to hardware in suspend/resume case, otherwise a potential
>     race will occur and the ring will get no chance to stay
>     empty before suspend.
> 
>     v2: Call drm_sched_resubmit_job before drm_sched_start to
>     restart jobs from the pending list.
> 
>     Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>     Suggested-by: Christian König <christian.koenig at amd.com>
>     Signed-off-by: Guchun Chen <guchun.chen at amd.com>
>     Reviewed-by: Christian König <christian.koenig at amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>     Cc: stable at vger.kernel.org
> 
> I bisected that the patch before this one was integrated can handle over 1,000
> back-to-back "freeze" system suspend cycles.  Yet, when this patch is present,
> the system may crash before it completes only 100 cycles, and at most lasts a
> few hundred cycles.
> 
> This crash is present in all following upstream rc's, including 5.15-rc4.
> 
> When I revert this patch from 5.15-rc4, stability returns.
> 
> Usually, the crash is manifest by a black screen, and a system that does not
> respond to ping, and will only respond to a long AC power button press to
> remove power; and a subsequent cold reboot.
> 
> I have witnessed the crash occur, and the "ubuntu color themed" screen enters
> some sort of reverse video mode.  In this weird color mode, I've seen a text
> window oscillate between scrolling and un-scrolling for a line -- sort of like
> it is going back in time, but then changes its mind.  There is no response to
> keyboard, mouse, or network input.
> 



More information about the dri-devel mailing list