[Mesa-dev] [Bug 108900] Non-recoverable GPU hangs with GfxBench v5 Aztec Ruins Vulkan test

Tue Feb 5 17:25:58 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=108900

--- Comment #6 from Eero Tamminen <eero.t.tamminen at intel.com> ---
Ubuntu 18.04 got LLVM 7 so I checked this again with latest Mesa
(4f0a3c9f9eda65), and the hangs are still happening:
-----------------------------------------------
[61150.480135] Iteration 1/1: testfw_app --gfx vulkan --gl_api vulkan --width
1920 --height 1080 --fullscreen 1 --test_id vulkan_5_normal
[61214.327444] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0880c for
process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327446] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x001001F4
[61214.327447] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x0A08800C
[61214.327449] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at
page 1049076, read from 'TC4' (0x54433400) (136)
[61214.327456] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fa0840c for
process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327457] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x001001FE
[61214.327458] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x0A10800C
[61214.327459] amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32772) at
page 1049086, read from 'TC2' (0x54433200) (264)
[61214.327534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0fb0840c for
process testfw_app pid 26882 thread testfw_app pid 26883
[61214.327535] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x00000000
[61214.327535] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x0A18802C
[61214.327537] amdgpu 0000:01:00.0: VM fault (0x2c, vmid 5, pasid 32772) at
page 0, read from 'TC0' (0x54433000) (392)
[61224.391478] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=11267, emitted seq=11269
[61224.391506] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process testfw_app pid 26882 thread testfw_app pid 26883
[61224.391509] amdgpu 0000:01:00.0: GPU reset begin!
[61233.409935] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501
kthread_park+0x76/0x90
[61233.409937] Modules linked in: fuse snd_hda_codec_realtek
snd_hda_codec_generic amdgpu i915 x86_pkg_temp_thermal coretemp
snd_hda_codec_hdmi snd_hda_intel snd_hda_codec e1000e crct10dif_pclmul
snd_hwdep snd_hda_core crc32_pclmul chash gpu_sched snd_pcm ttm igb mei_me mei
pinctrl_sunrisepoint pinctrl_intel
[61233.409954] CPU: 6 PID: 26883 Comm: testfw_app Not tainted
5.0.0-rc5-CI-Nightly_1639+ #1
[61233.409956] Hardware name: Intel Corporation NUC8i7HVK/NUC8i7HVB, BIOS
HNKBLi70.86A.0053.2018.1217.1739 12/17/2018
[61233.409959] RIP: 0010:kthread_park+0x76/0x90
[61233.409961] Code: 00 48 89 df e8 eb b6 00 00 48 85 c0 74 25 31 c0 5b 5d c3
0f 0b a8 04 48 8b af f0 04 00 00 74 b0 0f 0b b8 da ff ff ff 5b 5d c3 <0f> 0b b8
f0 ff ff ff eb dd 0f 0b eb d9 0f 1f 00 66 2e 0f 1f 84 00
[61233.409963] RSP: 0018:ffffc900016f7b80 EFLAGS: 00010202
[61233.409965] RAX: 0000000000000004 RBX: ffff888466df68c0 RCX:
0000000000000000
[61233.409966] RDX: ffff888459f46ef0 RSI: ffff888466df68c0 RDI:
ffff88846b7e0dc0
[61233.409968] RBP: ffff88845f8694e0 R08: 000037b103a4e400 R09:
0000000000000139
[61233.409969] R10: ffffc900000f7dd0 R11: 00000000000029ec R12:
ffff888450031c00
[61233.409970] R13: ffff888459f427c0 R14: 0000000000000202 R15:
ffff888450031cc8
[61233.409972] FS:  00007f7c03838700(0000) GS:ffff88846eb80000(0000)
knlGS:0000000000000000
[61233.409974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61233.409975] CR2: 00007f895193cca0 CR3: 000000000320e005 CR4:
00000000003606e0
[61233.409976] Call Trace:
[61233.409985]  drm_sched_entity_fini+0x35/0x180 [gpu_sched]
[61233.410050]  amdgpu_vm_fini+0xba/0x520 [amdgpu]
[61233.410056]  ? idr_destroy+0x7a/0xc0
[61233.410098]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
[61233.410104]  drm_file_free.part.0+0x21b/0x300
[61233.410108]  drm_release+0x9d/0x110
[61233.410113]  __fput+0xa2/0x1d0
[61233.410117]  task_work_run+0x84/0xa0
[61233.410121]  do_exit+0x308/0xba0
[61233.410126]  do_group_exit+0x33/0xa0
[61233.410128]  get_signal+0x203/0x5f0
[61233.410133]  do_signal+0x30/0x6c0
[61233.410137]  ? do_vfs_ioctl+0xa4/0x630
[61233.410140]  ? __do_munmap+0x308/0x400
[61233.410144]  exit_to_usermode_loop+0x96/0xb0
[61233.410147]  do_syscall_64+0xe7/0x100
[61233.410150]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[61233.410153] RIP: 0033:0x7f7c05c3a5d7
[61233.410159] Code: Bad RIP value.
[61233.410160] RSP: 002b:00007f7c03836da8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[61233.410162] RAX: fffffffffffffe00 RBX: 00007f7c03836e34 RCX:
00007f7c05c3a5d7
[61233.410164] RDX: 00007f7c03836de0 RSI: 00000000c0206449 RDI:
0000000000000006
[61233.410165] RBP: 00007f7c03836de0 R08: 0000000000000000 R09:
0000000000000000
[61233.410167] R10: 00000000005544cc R11: 0000000000000246 R12:
00000000c0206449
[61233.410168] R13: 0000000000000006 R14: 00007f7c03836e60 R15:
00007f7bfc424ad8
[61233.410174] WARNING: CPU: 6 PID: 26883 at kernel/kthread.c:501
kthread_park+0x76/0x90
[61233.410175] ---[ end trace 1a47caa3a1dbe0cf ]---
-----------------------------------------------

PS. Outputting the test commands to dmesg before they start is IMHO nice way to
find out which of the automated tests triggers hangs.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190205/30b0ba8a/attachment.html>