[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail

Tue Jul 21 20:33:54 UTC 2020

https://bugzilla.kernel.org/show_bug.cgi?id=207383

--- Comment #76 from mnrzk at protonmail.com ---
(In reply to Kees Cook from comment #75)
> Hi!
> 
> First, let me say sorry for all the work my patch has caused! It seems like
> it might be tickling another (previously dormant) bug in the gpu driver.
> 
> 
> (In reply to mnrzk from comment #30)
> > I've been looking at this bug for a while now and I'll try to share what
> > I've found about it.
> > 
> > In some conditions, when amdgpu_dm_atomic_commit_tail calls
> > dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> > dm_atomic_state* with an garbage context pointer.
> > 
> > I've also found that this bug exclusively occurs when commit_work is on the
> > workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> > without adding to the workqueue and running the OS, the issue seems to have
> > disappeared. The system was stable for at least 1.5 hours before I manually
> > shut it down (meanwhile it has usually crashed within 30-45 minutes).
> > 
> > Perhaps there's some sort of race condition occurring after commit_work is
> > queued?
> 
> If it helps to explain what's happening in 3202fa62f, the kernel memory
> allocator is moving it's free pointer from offset 0 to the middle of the
> object. That means that when the memory is freed, it writes 8 bytes to join
> the newly freed memory into the allocator's freelist. That always happened,
> but after 3202fa62f it began writing it in the middle, not offset 0. If the
> work queue is trying to use freed memory, and before it didn't notice the
> first 8 bytes getting written, now it appears to notice the overwrite... but
> that still means something is freeing memory before it should.
> 
> Finding that might be a real trick. :( However, if you've suffered through
> all those bisections, I wonder if you can try one other thing, which is to
> compile the kernel with KASAN:
> 
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_OUTLINE=y
> CONFIG_KASAN_STACK=y
> CONFIG_KASAN_VMALLOC=y
> 
> This will make things _slow_, which might mean the use-after-free race may
> never trigger. *However* it's possible that it'll catch a bad behavior
> before it even needs to get hit in a race that triggers the behavior you're
> seeing. (And note that swapping CONFIG_KASAN_OUTLINE=y for
> CONFIG_KASAN_INLINE=y might speed things up, but the kernel image gets
> bigger).
> 
> I'm going to try to read the work queue code for the driver and see if
> anything obvious stands out...

Actually this makes perfect sense, struct dm_atomic_state* dm_state has
two components, base (a struct containing a struct drm_atomic_state*) and
context (a struct dc_state*). Reading through the code of
amdgpu_dm_atomic_commit_tail, I see that dm_state->base is never used.

If my understanding is correct, base would have previously been filled with
the freelist pointer (since it's the first 8 bytes). Now since the freelist
pointer is being put in the middle (rounded to the nearest sizeof(void*),
 or 8 bytes), it's being put in the last 8 bytes of *dm_state
(or dm_state->context).

I'll place a void* for padding in the middle of struct dm_atomic_state* and
if my hypothesis is correct, the padding will be filled with garbage data
instead of context and the bug should be fixed. Of course, there would
still be a use-after-free bug in the code which may cause other issues in
the future so I wouldn't really consider it a solution.

Regarding KASAN, I've tried compiling the kernel with KASAN enabled and
from my experience, the bug did not trigger after actively using the system
for 3 hours and leaving it on for 12 hours. This was almost a month ago
though so maybe I'll try again with different KASAN options (i.e.
CONFIG_KASAN_INLINE=y). If anyone has any more tips on getting KASAN to run
faster, I'll be glad to hear them.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.