[Bug 91880] Radeonsi on Grenada cards (r9 390) exceptionally unstable and poorly performing

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Mar 18 20:31:41 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=91880

--- Comment #186 from Chris Heald <cheald at gmail.com> ---
I've been doing a lot of experimentation, and I've found a few more things that
I feel are probably related:

* I can force a system hard-lock by doing anything which disables a monitor.
Notably, going full-screen under KDE/Xorg does this, but I can trigger it just
as easily by disabling a monitor with xrandr. Fullscreen under gnome doesn't
seem to trigger the issue, which I suspect is due to gnome's using mutter for
screen management.

* Occassioanlly, the system boots up and gets stuck with a 150MHz memory clock,
rather than clocking up to the 1500MHz state. This causes the display
corruption even if the sclk is set to 500MHz+. Setting the mclk mask manually
fixes display corruption.

* I've been experimenting with different kernels ranging from 4.4 to 4.16rc5.
Earlier kernels feel more susceptible to hard-locking, though the later kernels
aren't immune to it.

* I tried a fresh Ubuntu 16.04 LTS install, and while it did NOT exhibit the
artifacting behavior, the system hard-locked within a few minutes of light
desktop usage.

I've had a few classes of exceptions show up in kern.log:

On 4.4, my kde/wayland session hard-froze when moving a window, and produced a
log like this:

kernel: [  116.904013] radeon 0000:06:00.0: GPU fault detected: 146 0x0d8e040c
kernel: [  116.904017] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x0001776C
kernel: [  116.904019] radeon 0000:06:00.0:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E10400C
kernel: [  116.904021] VM fault (0x0c, vmid 7) at page 96108, read from 'TC3'
(0x54433300) (260)
kernel: [  127.306156] radeon 0000:06:00.0: ring 0 stalled for more than
10404msec
kernel: [  127.306164] radeon 0000:06:00.0: GPU lockup (current fence id
0x0000000000002419 last fence id 0x0000000000002431 on ring 0)
kernel: [  127.357942] radeon 0000:06:00.0: Saved 2200 dwords of commands on
ring 0.
kernel: [  127.357961] radeon 0000:06:00.0: GPU softreset: 0x00000009
kernel: [  127.357963] radeon 0000:06:00.0:   GRBM_STATUS=0xF5D01028
kernel: [  127.357965] radeon 0000:06:00.0:   GRBM_STATUS2=0x50000008
kernel: [  127.357968] radeon 0000:06:00.0:   GRBM_STATUS_SE0=0xEC400002
kernel: [  127.357970] radeon 0000:06:00.0:   GRBM_STATUS_SE1=0xEC400002
kernel: [  127.357972] radeon 0000:06:00.0:   GRBM_STATUS_SE2=0x08000002
kernel: [  127.357974] radeon 0000:06:00.0:   GRBM_STATUS_SE3=0xEC000002
kernel: [  127.357976] radeon 0000:06:00.0:   SRBM_STATUS=0x20000040
kernel: [  127.357978] radeon 0000:06:00.0:   SRBM_STATUS2=0x00000000
kernel: [  127.357980] radeon 0000:06:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
kernel: [  127.357982] radeon 0000:06:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
kernel: [  127.357984] radeon 0000:06:00.0:   CP_STAT = 0x84228600
kernel: [  127.357986] radeon 0000:06:00.0:   CP_STALLED_STAT1 = 0x00000c00
kernel: [  127.357988] radeon 0000:06:00.0:   CP_STALLED_STAT2 = 0x40000000
kernel: [  127.357991] radeon 0000:06:00.0:   CP_STALLED_STAT3 = 0x00000400
kernel: [  127.357993] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT = 0x00000006
kernel: [  127.357995] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 = 0x00000003
kernel: [  127.357997] radeon 0000:06:00.0:   CP_CPF_STATUS = 0x80000063
kernel: [  127.357999] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT = 0x00000000
kernel: [  127.358001] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
kernel: [  127.358003] radeon 0000:06:00.0:   CP_CPC_STATUS = 0x00000000
kernel: [  127.358005] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00000000
kernel: [  127.358007] radeon 0000:06:00.0:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
kernel: [  127.404670] radeon 0000:06:00.0: GRBM_SOFT_RESET=0x00010001
kernel: [  127.404725] radeon 0000:06:00.0: SRBM_SOFT_RESET=0x00000100
kernel: [  127.405874] radeon 0000:06:00.0:   GRBM_STATUS=0x00003028
kernel: [  127.405876] radeon 0000:06:00.0:   GRBM_STATUS2=0x00000008
kernel: [  127.405878] radeon 0000:06:00.0:   GRBM_STATUS_SE0=0x00000006
kernel: [  127.405880] radeon 0000:06:00.0:   GRBM_STATUS_SE1=0x00000006
kernel: [  127.405882] radeon 0000:06:00.0:   GRBM_STATUS_SE2=0x00000006
kernel: [  127.405884] radeon 0000:06:00.0:   GRBM_STATUS_SE3=0x00000006
kernel: [  127.405885] radeon 0000:06:00.0:   SRBM_STATUS=0x20000A40
kernel: [  127.405887] radeon 0000:06:00.0:   SRBM_STATUS2=0x00000000
kernel: [  127.405889] radeon 0000:06:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
kernel: [  127.405891] radeon 0000:06:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT = 0x00000000
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT = 0x00000000
kernel: [  127.405895] radeon 0000:06:00.0:   CP_STALLED_STAT1 = 0x00000000
kernel: [  127.405896] radeon 0000:06:00.0:   CP_STALLED_STAT2 = 0x00000000
kernel: [  127.405898] radeon 0000:06:00.0:   CP_STALLED_STAT3 = 0x00000000
kernel: [  127.405900] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT = 0x00000000
kernel: [  127.405902] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
kernel: [  127.405903] radeon 0000:06:00.0:   CP_CPF_STATUS = 0x00000000
kernel: [  127.405905] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT = 0x00000000
kernel: [  127.405907] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
kernel: [  127.405909] radeon 0000:06:00.0:   CP_CPC_STATUS = 0x00000000
kernel: [  127.405929] radeon 0000:06:00.0: GPU reset succeeded, trying to
resume
kernel: [  127.658172] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
kernel: [  127.658189] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm
resume failed
kernel: [  127.658194] [drm] probing gen 2 caps for device 1022:1453 = 733903/e
kernel: [  127.658197] [drm] PCIE gen 3 link speeds already enabled
kernel: [  127.664213] [drm] PCIE GART of 2048M enabled (table at
0x0000000000326000).
kernel: [  127.664341] radeon 0000:06:00.0: WB enabled
kernel: [  127.664344] radeon 0000:06:00.0: fence driver on ring 0 use gpu addr
0x0000000200000c00 and cpu addr 0xffff8807f3799c00
kernel: [  127.664346] radeon 0000:06:00.0: fence driver on ring 1 use gpu addr
0x0000000200000c04 and cpu addr 0xffff8807f3799c04
kernel: [  127.664347] radeon 0000:06:00.0: fence driver on ring 2 use gpu addr
0x0000000200000c08 and cpu addr 0xffff8807f3799c08
kernel: [  127.664349] radeon 0000:06:00.0: fence driver on ring 3 use gpu addr
0x0000000200000c0c and cpu addr 0xffff8807f3799c0c
kernel: [  127.664350] radeon 0000:06:00.0: fence driver on ring 4 use gpu addr
0x0000000200000c10 and cpu addr 0xffff8807f3799c10
kernel: [  127.664772] radeon 0000:06:00.0: fence driver on ring 5 use gpu addr
0x0000000000078b30 and cpu addr 0xffffc90003c38b30
kernel: [  127.664933] radeon 0000:06:00.0: fence driver on ring 6 use gpu addr
0x0000000200000c18 and cpu addr 0xffff8807f3799c18
kernel: [  127.664934] radeon 0000:06:00.0: fence driver on ring 7 use gpu addr
0x0000000200000c1c and cpu addr 0xffff8807f3799c1c
kernel: [  127.666482] [drm] ring test on 0 succeeded in 2 usecs
kernel: [  127.666568] [drm] ring test on 1 succeeded in 2 usecs
kernel: [  127.666586] [drm] ring test on 2 succeeded in 2 usecs
kernel: [  127.666735] [drm] ring test on 3 succeeded in 3 usecs
kernel: [  127.666745] [drm] ring test on 4 succeeded in 3 usecs
kernel: [  127.692636] [drm] ring test on 5 succeeded in 1 usecs
kernel: [  127.712543] [drm] UVD initialized successfully.
kernel: [  127.813896] [drm] ring test on 6 succeeded in 708 usecs
kernel: [  127.813920] [drm] ring test on 7 succeeded in 3 usecs
kernel: [  127.813921] [drm] VCE initialized successfully.
kernel: [  127.814029] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm
resume failed

On 4.15.10-041510-generic, I left my computer running overnight and came back
to it frozen with this in kern.log:

Mar 18 04:25:10 Gaia kernel: [  559.092721] BUG: stack guard page was hit at
000000001ecd1fa8 (stack is 0000000020941864..00000000cf703fbf)
Mar 18 04:25:10 Gaia kernel: [  559.092729] kernel stack overflow (page fault):
0000 [#1] SMP NOPTI
Mar 18 04:25:10 Gaia kernel: [  559.092733] Modules linked in:
nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay xfrm_user
xfrm4_tunnel tunnel4 l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel ipcomp
xfrm_ipcomp udp_tunnel esp4 pppox ah4 af_key xfrm_algo xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c
ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables devlink iptable_filter binfmt_misc
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
edac_mce_amd snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib kvm_amd
snd_hwdep kvm uvcvideo snd_seq_midi irqbypass snd_seq_midi_event snd_rawmidi
crct10dif_pclmul videobuf2_vmalloc crc32_pclmul
Mar 18 04:25:10 Gaia kernel: [  559.092784]  videobuf2_memops videobuf2_v4l2
snd_seq ghash_clmulni_intel videobuf2_core snd_pcm pcbc videodev snd_seq_device
media snd_timer joydev aesni_intel aes_x86_64 snd crypto_simd input_leds
glue_helper serio_raw soundcore cryptd ccp k10temp shpchp mac_hid wmi_bmof
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic
usbhid hid amdkfd amd_iommu_v2 amdgpu chash radeon i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4
r8169 ahci mii libahci wmi gpio_amdpt gpio_generic
Mar 18 04:25:10 Gaia kernel: [  559.092832] CPU: 5 PID: 7352 Comm: tail
Tainted: G        W        4.15.10-041510-generic #201803152130
Mar 18 04:25:10 Gaia kernel: [  559.092834] Hardware name: Gigabyte Technology
Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F10 12/01/2017
Mar 18 04:25:10 Gaia kernel: [  559.092881] RIP:
0010:amdgpu_get_pp_num_states+0x88/0x120 [amdgpu]
Mar 18 04:25:10 Gaia kernel: [  559.092884] RSP: 0018:ffffb3cb8a837ca8 EFLAGS:
00010282
Mar 18 04:25:10 Gaia kernel: [  559.092888] RAX: 00000000000000d4 RBX:
ffffb3cb8a837cac RCX: 0000000000000001
Mar 18 04:25:10 Gaia kernel: [  559.092890] RDX: 0000000000000000 RSI:
ffffffffc087a88c RDI: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092893] RBP: ffffb3cb8a837d20 R08:
ffffffffc087a865 R09: ffff88c9ecebd98b
Mar 18 04:25:10 Gaia kernel: [  559.092895] R10: 0000000000000000 R11:
ffff88c9ecebd98a R12: ffff88c9ecebd000
Mar 18 04:25:10 Gaia kernel: [  559.092898] R13: ffffffffc087a858 R14:
00000000000000d4 R15: 0000000000000993
Mar 18 04:25:10 Gaia kernel: [  559.092901] FS:  00007fccb1787540(0000)
GS:ffff88c9fe740000(0000) knlGS:0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092904] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Mar 18 04:25:10 Gaia kernel: [  559.092906] CR2: ffffb3cb8a838000 CR3:
00000004a30d0000 CR4: 00000000003406e0
Mar 18 04:25:10 Gaia kernel: [  559.092909] Call Trace:
Mar 18 04:25:10 Gaia kernel: [  559.092918]  ?
tty_insert_flip_string_fixed_flag+0x86/0xe0
Mar 18 04:25:10 Gaia kernel: [  559.092925]  dev_attr_show+0x23/0x60
Mar 18 04:25:10 Gaia kernel: [  559.092931]  sysfs_kf_seq_show+0xa3/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092935]  kernfs_seq_show+0x27/0x30
Mar 18 04:25:10 Gaia kernel: [  559.092939]  seq_read+0xe5/0x430
Mar 18 04:25:10 Gaia kernel: [  559.092943]  kernfs_fop_read+0x137/0x180
Mar 18 04:25:10 Gaia kernel: [  559.092948]  __vfs_read+0x3a/0x170
Mar 18 04:25:10 Gaia kernel: [  559.092954]  ?
security_file_permission+0xa1/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092958]  vfs_read+0x8e/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092962]  SyS_read+0x55/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092967]  do_syscall_64+0x73/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092973] 
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 18 04:25:10 Gaia kernel: [  559.092976] RIP: 0033:0x7fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092978] RSP: 002b:00007ffc17d84d68 EFLAGS:
00000246 ORIG_RAX: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092982] RAX: ffffffffffffffda RBX:
0000000000002000 RCX: 00007fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092984] RDX: 0000000000002000 RSI:
00007ffc17d84db0 RDI: 0000000000000003
Mar 18 04:25:10 Gaia kernel: [  559.092986] RBP: 0000000000000000 R08:
0000000000000000 R09: 00007fccb1313b40
Mar 18 04:25:10 Gaia kernel: [  559.092988] R10: 00000000fffffff3 R11:
0000000000000246 R12: 00007ffc17d84db0
Mar 18 04:25:10 Gaia kernel: [  559.092991] R13: 0000000000000003 R14:
ffffffffffffffff R15: 000055e8f3b747e0
Mar 18 04:25:10 Gaia kernel: [  559.092994] Code: c7 c2 7a a8 87 c0 be 00 10 00
00 4c 89 e7 e8 d0 08 90 d1 41 89 c7 8b 45 8c 85 c0 74 72 48 8d 5d 8c 45 31 f6
49 c7 c5 58 a8 87 c0 <42> 8b 44 b3 04 44 89 f1 4d 89 e8 83 f8 0a 74 2d 83 f8 02
49 c7
Mar 18 04:25:10 Gaia kernel: [  559.093080] RIP:
amdgpu_get_pp_num_states+0x88/0x120 [amdgpu] RSP: ffffb3cb8a837ca8
Mar 18 04:25:10 Gaia kernel: [  559.093084] ---[ end trace dbba232a9ca4c5c7
]---

Possibly related, if I `cat pp_num_states` from a terminal, I get a
segmentation fault:

root at Gaia:~# cat /sys/class/drm/card0/device/pp_num_states
Segmentation fault

I'm going to continue to dig. Let me know what logs/tests/whatnot I can provide
that would be useful.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180318/9bc9e2fa/attachment-0001.html>


More information about the dri-devel mailing list