<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Unprivileged user mode program can cause GPU reset"
href="https://bugs.freedesktop.org/show_bug.cgi?id=109978">109978</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Unprivileged user mode program can cause GPU reset
</td>
</tr>
<tr>
<th>Product</th>
<td>DRI
</td>
</tr>
<tr>
<th>Version</th>
<td>XOrg git
</td>
</tr>
<tr>
<th>Hardware</th>
<td>x86-64 (AMD64)
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux (All)
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>major
</td>
</tr>
<tr>
<th>Priority</th>
<td>medium
</td>
</tr>
<tr>
<th>Component</th>
<td>DRM/amdkfd
</td>
</tr>
<tr>
<th>Assignee</th>
<td>dri-devel@lists.freedesktop.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>sudolskym@gmail.com
</td>
</tr></table>
<p>
<div>
<pre><a href="https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72">https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72</a>
Sample program which causes this (needs ROCm):
<span class="quote">> #include <hc.hpp>
> int main()
> {
> parallel_for_each(hc::extent<1>(1), [=]() [[hc]]
> {
> asm("s_trap 2");
> });
> return 0;
> }</span >
<span class="quote">> hcc -hc main.cpp
> ./a.out</span >
Process never ends and CTRL-C causes GPU reset which breaks all other processes
actually using rocm on that GPU. Seems trap handler expects queue handle in
s[0:1] which is set when using __builtin_trap() so without it trap handler
causes another exceptions.
System logs:
[ 247.428727] qcm fence wait loop timeout expired
[ 247.428730] The cp might be in an unrecoverable state due to an unsuccessful
queues preemption
[ 247.428736] amdgpu 0000:0b:00.0: GPU reset begin!
[ 247.619440] amdgpu 0000:0b:00.0: GPU reset
[ 248.152762] [drm] psp mode1 reset succeed
[ 248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[ 248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[ 248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[ 248.279769] [drm] PSP is resuming...
[ 248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[ 248.472774] WARNING: CPU: 23 PID: 21634 at
/build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80
[ 248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs
msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_amd
arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi kvm
snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb
snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_pcm
snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp
snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1
[ 248.472823] multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE)
amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryptd
drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nvme
i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_amdpt
wmi gpio_generic
[ 248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G OE
4.15.0-45-generic #48-Ubuntu
[ 248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[ 248.472849] RIP: 0010:kthread_park+0x67/0x80
[ 248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202
[ 248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX:
0000000000000000
[ 248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI:
ffff9ec63dd38000
[ 248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09:
0000000000000000
[ 248.472855] R10: 0000000000000000 R11: 0000000000000056 R12:
ffff9ec63ea0c0c0
[ 248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15:
0000000000000000
[ 248.472857] FS: 00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000)
knlGS:0000000000000000
[ 248.472858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4:
00000000003406e0
[ 248.472860] Call Trace:
[ 248.472865] amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched]
[ 248.472868] amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched]
[ 248.472907] amdgpu_vm_fini+0xbb/0x4f0 [amdgpu]
[ 248.472942] amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu]
[ 248.472952] drm_release+0x26b/0x390 [drm]
[ 248.472955] __fput+0xea/0x220
[ 248.472957] ____fput+0xe/0x10
[ 248.472959] task_work_run+0x9d/0xc0
[ 248.472961] do_exit+0x2ec/0xb40
[ 248.472963] do_group_exit+0x43/0xb0
[ 248.472965] get_signal+0x27b/0x590
[ 248.472968] do_signal+0x37/0x730
[ 248.472971] ? __switch_to_asm+0x34/0x70
[ 248.472973] ? __switch_to_asm+0x40/0x70
[ 248.472976] ? do_vfs_ioctl+0xa8/0x630
[ 248.472978] ? __schedule+0x299/0x8a0
[ 248.472980] exit_to_usermode_loop+0x73/0xd0
[ 248.472982] do_syscall_64+0x115/0x130
[ 248.472984] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 248.472986] RIP: 0033:0x7fd528bdd5d7
[ 248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX:
00007fd528bdd5d7
[ 248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI:
0000000000000003
[ 248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09:
0000000000000001
[ 248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12:
00000000c0184b0c
[ 248.472991] R13: 0000000000000003 R14: 0000000000000000 R15:
00000000fffffffe
[ 248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 5b
41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af <0f> 0b
41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00
[ 248.473020] ---[ end trace 19649ddd4a6314f7 ]---
[ 248.648453] [drm] UVD and UVD ENC initialized successfully.
[ 248.748509] [drm] VCE initialized successfully.
[ 248.749616] [drm] recover vram bo from shadow start
[ 248.749666] [drm] recover vram bo from shadow done
[ 248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>