[Bug 101325] UE4Editor crash after pressing "play" with radeon southern island card (7850 HD)

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Jun 7 01:19:18 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=101325

            Bug ID: 101325
           Summary: UE4Editor crash after pressing "play" with radeon
                    southern island card (7850 HD)
           Product: Mesa
           Version: git
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: Drivers/Gallium/radeonsi
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: dodgyville+freedesktop at gmail.com
        QA Contact: dri-devel at lists.freedesktop.org

Created attachment 131758
  --> https://bugs.freedesktop.org/attachment.cgi?id=131758&action=edit
~/ddebug_dumps/  file with environment variables R600_DEBUG=check_vm
GALLIUM_DDEBUG="pipelined 10000"

Hi, thanks for your great work on bringing hardware accelerated graphics to
linux.

I have a recurring problem with one 3D program (UE4editor) crashing my computer
during a particular operation (pressing "play" on any project including blank).

I believe the problem is at the mesa layer.

Brief details:
Radeon 7850HD
It crashes using radeon and amdgpu (on ubuntu 17.04, linux 4.11, padoka ppa).
It does not crash using the fglrx driver (on ubuntu 14.04).
The entire machine hangs and requires a reset.

Following a suggestion from Michael I set the following environment variables
for running ue4editor:

R600_DEBUG=check_vm GALLIUM_DDEBUG="pipelined 10000"

Now the program dumps without taking out the whole system. It created a file in
~/ddebug_dumps/ with more information about the GPU hang (attached).


I have also attached the two dmesg logs for two separate trials involving the
radeon and amdgpu modules. I blacklisted them both in grub2 and rebooted.

I first manually loaded amdgpu and then after triggering the crash in
ue4editor, rebooted and tried again with the manually loaded radeon module.

The radeon one seems to give more information and the screen flashed a few
times before freezing. When running amdgpu the x session just hangs and does
nothing.

I made sure I was running the latest (git 17-06-05) padoka builds.

I was able to ssh into the machine for a while after the x session had
completely frozen (including frozen mouse) until it also disconnected after a
few minutes.

This is the part of dmesg running radeon where it appears to go off the rails:

radeon 0000:07:00.0: ring 4 stalled for more than 10024msec
[  +0.000004] radeon 0000:07:00.0: GPU lockup (current fence id
0x0000000000000fd0 last fence id 0x0000000000000fd2 on ring 4)
[  +0.485614] radeon 0000:07:00.0: Saved 724 dwords of commands on ring 0.
[  +0.000126] radeon 0000:07:00.0: GPU softreset: 0x0000004D
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS               = 0xA0403028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x08000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x08000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000118] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000002] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400006
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x84038647
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83106
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000002] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x00000000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x00000000
[  +0.465637] radeon 0000:07:00.0: GRBM_SOFT_RESET=0x0000DDFF
[  +0.000052] radeon 0000:07:00.0: SRBM_SOFT_RESET=0x00100100
[  +0.001146] radeon 0000:07:00.0:   GRBM_STATUS               = 0x00003028
[  +0.000002] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000000] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000111] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000247] radeon 0000:07:00.0: GPU reset succeeded, trying to resume
[  +0.025448] [drm] probing gen 2 caps for device 8086:151 = 261a103/e
[  +0.000003] [drm] PCIE gen 3 link speeds already enabled
[  +0.002586] [drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
[  +0.000120] radeon 0000:07:00.0: WB enabled
[  +0.000002] radeon 0000:07:00.0: fence driver on ring 0 use gpu addr
0x0000000080000c00 and cpu addr 0xffff94260bdd8c00
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 1 use gpu addr
0x0000000080000c04 and cpu addr 0xffff94260bdd8c04
[  +0.000000] radeon 0000:07:00.0: fence driver on ring 2 use gpu addr
0x0000000080000c08 and cpu addr 0xffff94260bdd8c08
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 3 use gpu addr
0x0000000080000c0c and cpu addr 0xffff94260bdd8c0c
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 4 use gpu addr
0x0000000080000c10 and cpu addr 0xffff94260bdd8c10
[  +0.000314] radeon 0000:07:00.0: fence driver on ring 5 use gpu addr
0x0000000000075a18 and cpu addr 0xffffaa3d89635a18
[  +0.010136] radeon 0000:07:00.0: failed VCE resume (-22).
[  +0.159454] [drm] ring test on 0 succeeded in 4 usecs
[  +0.000004] [drm] ring test on 1 succeeded in 1 usecs
[  +0.000003] [drm] ring test on 2 succeeded in 1 usecs
[  +0.000009] [drm] ring test on 3 succeeded in 6 usecs
[  +0.000007] [drm] ring test on 4 succeeded in 5 usecs
[  +0.175707] [drm] ring test on 5 succeeded in 2 usecs
[  +0.000004] [drm] UVD initialized successfully.
[  +1.041140] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait timed out.
[  +0.000018] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed
testing IB on GFX ring (-110).
[  +0.474934] radeon 0000:07:00.0: GPU softreset: 0x00000048
[  +0.000002] radeon 0000:07:00.0:   GRBM_STATUS               = 0xA0003028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000118] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
[  +0.000002] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x84010243
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000002] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x00000000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x00000000
[  +0.465304] radeon 0000:07:00.0: GRBM_SOFT_RESET=0x0000DDFF
[  +0.000056] radeon 0000:07:00.0: SRBM_SOFT_RESET=0x00000100
[  +0.001147] radeon 0000:07:00.0:   GRBM_STATUS               = 0x00003028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000110] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000238] radeon 0000:07:00.0: GPU reset succeeded, trying to resume

Eventually it starts to do this:
[  +1.153721] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait timed
out.
[  +0.000018] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed
testing IB on ring 5 (-110).
[  +0.000008] radeon 0000:07:00.0: scheduling IB failed (-12).
[  +0.000011] [drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed
to schedule ib (-12).
[  +0.000018] [drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get
create msg (-12).
[  +0.000010] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed
testing IB on ring 6 (-12).
[  +0.000002] radeon 0000:07:00.0: scheduling IB failed (-12).
[  +0.000010] [drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed
to schedule ib (-12).
[  +0.000009] [drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get
create msg (-12).
[  +0.000009] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed
testing IB on ring 7 (-12).
[  +0.001058] radeon 0000:07:00.0: GPU fault detected: 147 0x00044802
[  +0.000003] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x000FF000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x04048002
[  +0.000001] VM fault (0x02, vmid 2) at page 1044480, read from TC (72)
[ +10.114896] radeon 0000:07:00.0: ring 0 stalled for more than 10116msec
[  +0.000004] radeon 0000:07:00.0: GPU lockup (current fence id
0x0000000000001506 last fence id 0x000000000000151a on ring 0)
[  +0.000022] radeon 0000:07:00.0: ring 4 stalled for more than 10112msec
[  +0.000003] radeon 0000:07:00.0: GPU lockup (current fence id
0x0000000000000fd3 last fence id 0x0000000000000fd7 on ring 4)
[  +0.000029] radeon 0000:07:00.0: ring 3 stalled for more than 10116msec
[  +0.000002] radeon 0000:07:00.0: GPU lockup (current fence id
0x00000000000018b7 last fence id 0x000000000000190c on ring 3)
[  +0.507937] radeon 0000:07:00.0: ring 3 stalled for more than 10624msec
[  +0.000001] radeon 0000:07:00.0: ring 0 stalled for more than 10624msec

Then the log repeats a few hundred times the non-utf-8 character "\00" before
it cuts off.



This is the part of dmesg running amdgpu where it appears to go off the rails:

[Jun 6 11:16] INFO: task RenderThread 3:6190 blocked for more than 120 seconds.
[  +0.000006]       Tainted: G           OE   4.11.0-mytest #2
[  +0.000002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[  +0.000002] RenderThread 3  D    0  6190   1612 0x00000000
[  +0.000003] Call Trace:
[  +0.000007]  __schedule+0x3c6/0x8c0
[  +0.000004]  schedule+0x36/0x80
[  +0.000043]  amd_sched_entity_push_job+0xc4/0x110 [amdgpu]
[  +0.000003]  ? wake_atomic_t_function+0x60/0x60
[  +0.000031]  amdgpu_job_submit+0x72/0x90 [amdgpu]
[  +0.000027]  amdgpu_vm_bo_split_mapping+0x51f/0x7c0 [amdgpu]
[  +0.000024]  ? amdgpu_vm_do_copy_ptes+0x90/0x90 [amdgpu]
[  +0.000024]  amdgpu_vm_clear_freed+0x70/0xb0 [amdgpu]
[  +0.000024]  amdgpu_gem_va_ioctl+0x39a/0x3f0 [amdgpu]
[  +0.000014]  drm_ioctl+0x218/0x4b0 [drm]
[  +0.000010]  ? drm_ioctl+0x218/0x4b0 [drm]
[  +0.000023]  ? amdgpu_gem_metadata_ioctl+0x1d0/0x1d0 [amdgpu]
[  +0.000003]  ? kmem_cache_free+0x1b6/0x1e0
[  +0.000020]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  +0.000003]  do_vfs_ioctl+0xa3/0x600
[  +0.000002]  ? ____fput+0xe/0x10
[  +0.000003]  ? task_work_run+0x85/0xa0
[  +0.000002]  SyS_ioctl+0x79/0x90
[  +0.000002]  entry_SYSCALL_64_fastpath+0x1e/0xad
[  +0.000002] RIP: 0033:0x7fb15e3cb987
[  +0.000001] RSP: 002b:00007faefab6a778 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000002] RAX: ffffffffffffffda RBX: 00007faf340060e0 RCX: 00007fb15e3cb987
[  +0.000001] RDX: 00007faefab6a7c0 RSI: 00000000c0286448 RDI: 0000000000000009
[  +0.000001] RBP: 00007faefab6a810 R08: 000000011c430000 R09: 000000000000000e
[  +0.000001] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000040086409
[  +0.000001] R13: 0000000000000009 R14: 00007faf34ef5700 R15: 00007faf34d61780
[Jun 6 11:17] wlp4s0: disconnect from AP 80:2a:a8:11:23:5e for new auth to
80:2a:a8:11:24:29
[  +0.011326] wlp4s0: authenticate with 80:2a:a8:11:24:29
[  +0.020720] wlp4s0: send auth to 80:2a:a8:11:24:29 (try 1/3)
[  +0.002594] wlp4s0: authenticated
[  +0.002803] wlp4s0: associate with 80:2a:a8:11:24:29 (try 1/3)
[  +0.005088] wlp4s0: RX AssocResp from 80:2a:a8:11:24:29 (capab=0x431 status=0
aid=2)
[  +0.000223] wlp4s0: associated



End of xorg.log.0 (for an earlier run)

[    12.056] (II) systemd-logind: got pause for 13:79
[    32.415] (II) config/udev: removing GPU device
/sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/drm/card0 /dev/dri/card0
[    32.415] (II) config/udev: Adding drm device (/dev/dri/card0)
[    32.415] (II) xfree86: Adding drm device (/dev/dri/card0)
[    35.984] (II) systemd-logind: got resume for 13:81
[    35.984] (EE) FBDEV(0): FBIOPUT_VSCREENINFO: No such device
[    35.984] (EE)
Fatal server error:
[    35.984] (EE) EnterVT failed for screen 0
[    35.984] (EE)
[    35.984] (EE)
Please consult the The X.Org Foundation support
     at http://wiki.x.org
 for help.
[    35.984] (EE) Please also check the log file at "/var/log/Xorg.0.log" for
additional information.
[    35.984] (EE)
[    35.984] (EE) FBDEV(0): FBIOPUT_VSCREENINFO: No such device
[    36.074] (EE) Server terminated with error (1). Closing log file.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170607/0ce2b45c/attachment-0001.html>


More information about the dri-devel mailing list