[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail

bugzilla-daemon at bugzilla.kernel.org bugzilla-daemon at bugzilla.kernel.org
Tue Jul 21 19:32:30 UTC 2020


https://bugzilla.kernel.org/show_bug.cgi?id=207383

--- Comment #75 from Kees Cook (kees at outflux.net) ---
Hi!

First, let me say sorry for all the work my patch has caused! It seems like it
might be tickling another (previously dormant) bug in the gpu driver.


(In reply to mnrzk from comment #30)
> I've been looking at this bug for a while now and I'll try to share what
> I've found about it.
> 
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.
> 
> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared. The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).
> 
> Perhaps there's some sort of race condition occurring after commit_work is
> queued?

If it helps to explain what's happening in 3202fa62f, the kernel memory
allocator is moving it's free pointer from offset 0 to the middle of the
object. That means that when the memory is freed, it writes 8 bytes to join the
newly freed memory into the allocator's freelist. That always happened, but
after 3202fa62f it began writing it in the middle, not offset 0. If the work
queue is trying to use freed memory, and before it didn't notice the first 8
bytes getting written, now it appears to notice the overwrite... but that still
means something is freeing memory before it should.

Finding that might be a real trick. :( However, if you've suffered through all
those bisections, I wonder if you can try one other thing, which is to compile
the kernel with KASAN:

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_OUTLINE=y
CONFIG_KASAN_STACK=y
CONFIG_KASAN_VMALLOC=y

This will make things _slow_, which might mean the use-after-free race may
never trigger. *However* it's possible that it'll catch a bad behavior before
it even needs to get hit in a race that triggers the behavior you're seeing.
(And note that swapping CONFIG_KASAN_OUTLINE=y for CONFIG_KASAN_INLINE=y might
speed things up, but the kernel image gets bigger).

I'm going to try to read the work queue code for the driver and see if anything
obvious stands out...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.


More information about the dri-devel mailing list