[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail
bugzilla-daemon at bugzilla.kernel.org
bugzilla-daemon at bugzilla.kernel.org
Tue Jul 21 20:49:36 UTC 2020
https://bugzilla.kernel.org/show_bug.cgi?id=207383
--- Comment #77 from Kees Cook (kees at outflux.net) ---
(Midair collision... you saw the same about the structure layout as I did.
Here's my comment...)
(In reply to mnrzk from comment #30)
> I've been looking at this bug for a while now and I'll try to share what
> I've found about it.
>
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.
It looks like when amdgpu_dm_atomic_commit_tail() walks the private objects
list with for_each_new_private_obj_in_state(), it'll return the first object's
state when the function pointer tables match. This is a struct dm_atomic_state
allocation, which is 16 bytes:
struct drm_private_state {
struct drm_atomic_state *state;
};
struct dm_atomic_state {
struct drm_private_state base;
struct dc_state *context;
};
If struct dm_atomic_state is being freed early, this would match the behavior
seen: before 3202fa62f, .base.state would be overwritten with a freelist
pointer. After 3202fa62f, .context will be overwritten.
In looking for all "kfree(.*state" patterns in
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c, I see a few suspicious
things, maybe. dm_crtc_destroy_state() and amdgpu_dm_connector_funcs_reset() do
an explicit kfree(state) -- should they use dm_atomic_destroy_state() instead?
Or nothing at all, since I'd expect "state" to be managed by the drm layer via
the .atomic_destroy_state callback?
> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared. The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).
Is this the async call to "commit_work" in drm_atomic_helper_commit()?
There's a big warning in there:
/*
* Everything below can be run asynchronously without the need to grab
* any modeset locks at all under one condition: It must be guaranteed
* that the asynchronous work has either been cancelled (if the driver
* supports it, which at least requires that the framebuffers get
* cleaned up with drm_atomic_helper_cleanup_planes()) or completed
* before the new state gets committed on the software side with
* drm_atomic_helper_swap_state().
...
I'm not sure how to determine if amdgpu_dm.c is doing this correctly?
I can't tell what can interfere with drm_atomic_helper_commit() -- I would
guess the race is between that and something else causing a kfree(), but I
don't know the APIs here at all...
--
You are receiving this mail because:
You are watching the assignee of the bug.
More information about the dri-devel
mailing list