[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail

Thu Jul 16 02:12:52 UTC 2020

https://bugzilla.kernel.org/show_bug.cgi?id=207383

--- Comment #65 from Duncan (1i5t5.duncan at cox.net) ---
(In reply to Duncan from comment #63)
> NB: The 3202fa62f followups are cbfc35a48 and 89b83f282.  That should let
> anyone else with git and kernel building skills try reverting the three.
> 
> Still too early (by days) to call it nailed down as I've had it take 2-3
> days to trigger, but no gfx freeze here yet on that v5.8-rc5+ with
> 320-and-followups reverted so far, despite playing 4k video to try to
> trigger it as it has previously on affected kernels.  I'll be trying update
> builds (gentoo) later today or tomorrow, another previous trigger, so we'll
> see how it goes.

I'm still not saying for sure, but that's actually looking like the culprit.

Today's gentoo update included a dep of qtwebengine, which changed ABI so
qtwebengine needed rebuilt on top of it, and qtwebengine is chromium-based. 
And as anyone that's built chromium (or firefox for that matter) can tell you,
at least on older fx-based hardware, it's several hours of near constant 100%
all-cores.

While rebuilding qtwebengine (at a batch-nice of +19 so it doesn't interfere
too badly with anything else I want to run), I was playing youtube videos at
1080p, not normally a problem by themselves (tho 4k can be, especially 4k60)
but with qtwebengine building at the same time...

No freezes.

I'm going to run with the 320 commit and followups reverted a few more days
before declaring it for sure the culprit, and I'm watching for Anthony's
results as well, but the bug's sure doing a convincing job of hiding ATM if
that commit isn't the culprit!

I'd say it's time to start reviewing the amdgpu code to see what relocating the
slub freelist pointer to the middle of the object (what the 320 commit did
according to its git log explanation) could tickle, when the work goes on the
work queue to run later, since that's consistently what the logs say is the
scenario and what mnrzk confirmed by forcing it /not/ to go to the work queue
in comment #30.

Hopefully we can still get and confirm a proper codefix by 5.8.0 release. =:^)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.