[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail

Sun Jun 28 10:48:15 UTC 2020

https://bugzilla.kernel.org/show_bug.cgi?id=207383

Duncan (1i5t5.duncan at cox.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|5.7-rc1, 5.7-rc2, 5.7-rc3   |5.7-rc1 - 5.7 - 5.8-rc1+

--- Comment #31 from Duncan (1i5t5.duncan at cox.net) ---
(In reply to mnrzk from comment #30)
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.

Good! Someone with the bug who can actually read and work the code, now.
Portends well for a fix.  =:^)

> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared.

I see it always with the workqueue too, but not being a dev I simply assumed
that was how it was; I had no idea it could be taken off the workqueue.

> The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).

You're seeing a crash much faster than I am.  I believe my longest uptime
before a crash with the telltale trace was something like two and a half days,
with the obvious implications for bisect good since it's always a gamble that
I've simply not tested long enough.

> Perhaps there's some sort of race condition occurring after commit_work is
> queued?

Agreed, FWIW, tho you've taken it farther than I could, not being able to work
with code much beyond bisect or modifying an existing patch here or there.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.