[Bug 212739] New: [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups on Vega 10 (Raven Ridge)

Wed Apr 21 06:49:18 UTC 2021

https://bugzilla.kernel.org/show_bug.cgi?id=212739

            Bug ID: 212739
           Summary: [amdgpu] Sporadic GPU errors, screen artifacts and
                    GPU-induced system lockups on Vega 10 (Raven Ridge)
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.11.14-1, 5.12.rc7.d0411.gd434405-1
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri at kernel-bugs.osdl.org
          Reporter: tunas at cryptolab.net
        Regression: No

Created attachment 296449
  --> https://bugzilla.kernel.org/attachment.cgi?id=296449&action=edit
Example of GPU artifacts from the recoverable variant of this error

>From time to time, the amdgpu driver will report a page fault (sometimes coming
from pid 0, sometimes coming from the web browser, sometimes the screen
compositor or Xorg, sometimes a video player, etc.) as shown below:

>kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0
>ring:0 vmid:4 pasid:0, for process  pid 0 thread  pid 0)
>kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address
>0x800101606000 from client 27
>kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
>kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP
>(0x8)
>kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
>kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
>kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
>kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
>kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0`

This message is repeated several thousand times in dmesg ("x callbacks
suppressed") with different addresses of form 0x80010160Y000 (where Y is a hex
digit between 1-8.)
In the meantime, the computer is completely hung in terms of display, i.e.
inputs go through, music keeps playing, but the screen is static.

Then, several seconds later, it's followed by:
>kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences
>timed out!

And finally,

>[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft
>recovered

After this, the computer resumes operation (but with GPU artifacts having
appeared on the screen - for an example of these, see attached screenshot).

Alternatively, sometimes instead of the soft recovery message, the GPU cannot
recover and displays the following messages in the kernel log:

>kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access
>in command stream
>kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
>seq=3356413, emitted seq=3356415
>kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>process Xorg pid 14524 thread Xorg:cs0 pid 14539
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
>kernel: [drm] free PSP TMR buffer
>kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
>kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
>kernel: [drm] PSP is resuming...
>kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
>kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not
>available
>kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not
>available
>kernel: [drm] kiq ring mec 2 pipe 1 q 0
>kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR*
>ring sdma0 test failed (-110)
>kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP
>block <sdma_v4_0> failed -110
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(4) failed
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110

at which point rebooting is necessary as the GPU will not resume operation.

This also happens on the latest 5.12 rc (as of the writing of this bug report,
this is rc7).

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.