[Bug 110099] Unprivileged user mode program can cause GPU reset

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Mar 14 08:19:39 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=110099

            Bug ID: 110099
           Summary: Unprivileged user mode program can cause GPU reset
           Product: Spam
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: major
          Priority: medium
         Component: Two
          Assignee: daniel at fooishbar.org
          Reporter: baigshakira123 at gmail.com
                CC: dri-devel at lists.freedesktop.org, sudolskym at gmail.com
        Depends on: 109978

Created attachment 143663
  --> https://bugs.freedesktop.org/attachment.cgi?id=143663&action=edit
clone1

+++ This bug was initially created as a clone of Bug #109978 +++

https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72

Sample program which causes this (needs ROCm):

> #include <hc.hpp>
> int main()
> {
> 	parallel_for_each(hc::extent<1>(1), [=]() [[hc]]
> 	{
> 		asm("s_trap 2");
> 	});
> 	return 0;
> }

> hcc -hc main.cpp
> ./a.out

Process never ends and CTRL-C causes GPU reset which breaks all other processes
actually using rocm on that GPU. Seems trap handler expects queue handle in
s[0:1] which is set when using __builtin_trap() so without it trap handler
causes another exceptions.

System logs:

[  247.428727] qcm fence wait loop timeout expired
[  247.428730] The cp might be in an unrecoverable state due to an unsuccessful
queues preemption
[  247.428736] amdgpu 0000:0b:00.0: GPU reset begin!
[  247.619440] amdgpu 0000:0b:00.0: GPU reset
[  248.152762] [drm] psp mode1 reset succeed 
[  248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[  248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[  248.279769] [drm] PSP is resuming...
[  248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[  248.472774] WARNING: CPU: 23 PID: 21634 at
/build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80
[  248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs
msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_amd
arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi kvm
snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb
snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_pcm
snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp
snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1
[  248.472823]  multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE)
amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryptd
drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nvme
i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_amdpt
wmi gpio_generic
[  248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G           OE   
4.15.0-45-generic #48-Ubuntu
[  248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[  248.472849] RIP: 0010:kthread_park+0x67/0x80
[  248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202
[  248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX:
0000000000000000
[  248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI:
ffff9ec63dd38000
[  248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09:
0000000000000000
[  248.472855] R10: 0000000000000000 R11: 0000000000000056 R12:
ffff9ec63ea0c0c0
[  248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15:
0000000000000000
[  248.472857] FS:  00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000)
knlGS:0000000000000000
[  248.472858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4:
00000000003406e0
[  248.472860] Call Trace:
[  248.472865]  amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched]
[  248.472868]  amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched]
[  248.472907]  amdgpu_vm_fini+0xbb/0x4f0 [amdgpu]
[  248.472942]  amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu]
[  248.472952]  drm_release+0x26b/0x390 [drm]
[  248.472955]  __fput+0xea/0x220
[  248.472957]  ____fput+0xe/0x10
[  248.472959]  task_work_run+0x9d/0xc0
[  248.472961]  do_exit+0x2ec/0xb40
[  248.472963]  do_group_exit+0x43/0xb0
[  248.472965]  get_signal+0x27b/0x590
[  248.472968]  do_signal+0x37/0x730
[  248.472971]  ? __switch_to_asm+0x34/0x70
[  248.472973]  ? __switch_to_asm+0x40/0x70
[  248.472976]  ? do_vfs_ioctl+0xa8/0x630
[  248.472978]  ? __schedule+0x299/0x8a0
[  248.472980]  exit_to_usermode_loop+0x73/0xd0
[  248.472982]  do_syscall_64+0x115/0x130
[  248.472984]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  248.472986] RIP: 0033:0x7fd528bdd5d7
[  248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX:
00007fd528bdd5d7
[  248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI:
0000000000000003
[  248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09:
0000000000000001
[  248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12:
00000000c0184b0c
[  248.472991] R13: 0000000000000003 R14: 0000000000000000 R15:
00000000fffffffe
[  248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 5b
41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af <0f> 0b
41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00 
[  248.473020] ---[ end trace 19649ddd4a6314f7 ]---
[  248.648453] [drm] UVD and UVD ENC initialized successfully.
[  248.748509] [drm] VCE initialized successfully.
[  248.749616] [drm] recover vram bo from shadow start
[  248.749666] [drm] recover vram bo from shadow done
[  248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!


Referenced Bugs:

https://bugs.freedesktop.org/show_bug.cgi?id=109978
[Bug 109978] Unprivileged user mode program can cause GPU reset
-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190314/296b9312/attachment-0001.html>


More information about the dri-devel mailing list