(REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots

Michal Kubecek mkubecek at suse.cz
Thu Jun 2 14:22:54 UTC 2022


On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote:
> On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek at suse.cz> wrote:
> > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > > Hello,
> > >
> > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > > 7e284070abe5), I keep getting errors like below. I have not seen them
> > > with 5.18 final or older.
> > >
> > > ------------------------------------------------------------------------
> > > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> > [...]
> > > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> > [...]
> > > ------------------------------------------------------------------------
> > >
> > > There does not seem to be any apparent immediate problem with graphics
> > > but when running commit babf0bb978e3, there seemed to be a noticeable
> > > lag in some operations, e.g. when moving a window or repainting large
> > > part of the terminal window in konsole (no idea if it's related).
> > >
> > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > > I collect to help debugging the issue?
> >
> > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> > There seem to be later commits depending on it so I did not test
> > a revert on top of current mainline.
> >
> > I should also mention that most commits tested as "bad" during the
> > bisect did behave much worse than current mainline (errors starting as
> > early as with sddm, visibly damaged screen content, sometimes even
> > crashes). But all of them issued messages similar to those above into
> > kernel log.
> 
> Can you verify that the kernel you tested has this patch:
> https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d

Yes, both of them:

mike at lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes
yes

(7e284070abe5 is a later mainline snapshot so it also contains
5be323562c6a)

But it's likely that commit 5be323562c6a fixed most of the problem and
only some corner case was left as most bisect steps had many more error
messages and some even crashed before I was able to even log into KDE.
Compared to that, the mainline snapshots show much fewer errors, no
distorted picture and no crash; on the other hand, applications like
firefox or stellarium seem to trigger the errors quite consistently.

Michal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220602/09e7f83b/attachment-0001.sig>


More information about the amd-gfx mailing list