[Bug 109949] Hard lockup (Freeze)/GPU HANG: ecode 9:1:0xfffffffe with Skylake

Sun Mar 10 02:54:58 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=109949

--- Comment #2 from Paul <borrowck at protonmail.com> ---
Weird observation: Just when the hang happens, my netconsole receive is not
reachable over the network anymore (ssh). This seemed very weird but it seems
like it's being flooded by network packets coming from the main computer which
is unresponsive. By disconnecting the main computer from the switch, the
receiver side is reachable again.

I found this in the kernel log (netconsole receiver side):

[203009.876199] NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out
[203009.876248] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461
dev_watchdog+0x21a/0x220
[203009.876252] Modules linked in: veth ip6t_MASQUERADE ip6table_nat
nf_nat_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_CHECKSUM xt_comment
xt_tcpudp iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter bridge stp llc
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio
intel_rapl intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core
intel_pmc_ipc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915
nls_iso8859_1 nls_cp437 crct10dif_pclmul vfat crc32_pclmul fat
ghash_clmulni_intel kvmgt vfio_mdev mdev vfio_iommu_type1 vfio snd_soc_skl kvm
snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc snd_soc_sst_ipc
snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress
irqbypass ac97_bus i2c_algo_bit snd_pcm_dmaengine drm_kms_helper snd_hda_intel
snd_hda_codec snd_hda_core drm ppdev snd_hwdep aesni_intel snd_pcm snd_timer
aes_x86_64 crypto_simd snd intel_gtt cryptd glue_helper
[203009.876338]  intel_cstate intel_rapl_perf r8169 agpgart wdat_wdt mei_me
processor_thermal_device pcspkr tpm_crb soundcore realtek intel_soc_dts_iosf
syscopyarea libphy sysfillrect mei sysimgblt i2c_i801 fb_sys_fops evdev tpm_tis
parport_pc mac_hid parport tpm_tis_core int3400_thermal tpm acpi_thermal_rel
int3406_thermal pcc_cpufreq rng_core dptf_power int3403_thermal
int340x_thermal_zone ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
fscrypto sd_mod ahci libahci libata crc32c_intel xhci_pci xhci_hcd scsi_mod
[203009.876398] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.0.0-arch1-1-ARCH #1
[203009.876401] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./J4105B-ITX, BIOS P1.30 05/04/2018
[203009.876407] RIP: 0010:dev_watchdog+0x21a/0x220
[203009.876411] Code: 49 63 4c 24 e0 eb 8c 4c 89 ef c6 05 d2 5f bf 00 01 e8 0a
72 fc ff 89 d9 4c 89 ee 48 c7 c7 38 62 73 98 48 89 c2 e8 20 ec 96 ff <0f> 0b eb
be 66 90 0f 1f 44 00 00 48 c7 47 08 00 00 00 00 48 c7 07
[203009.876414] RSP: 0018:ffff9e2c37f03e78 EFLAGS: 00010286
[203009.876418] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
0000000000000000
[203009.876421] RDX: 0000000000000103 RSI: 00000000000000f6 RDI:
00000000ffffffff
[203009.876424] RBP: ffff9e2c3597245c R08: 0000000000000001 R09:
0000000000000366
[203009.876426] R10: 0000000000000004 R11: 0000000000000000 R12:
ffff9e2c35972480
[203009.876429] R13: ffff9e2c35972000 R14: 0000000000000001 R15:
ffff9e2c349ee680
[203009.876433] FS:  0000000000000000(0000) GS:ffff9e2c37f00000(0000)
knlGS:0000000000000000
[203009.876436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203009.876439] CR2: 00007f81eee7cd0f CR3: 000000000920e000 CR4:
0000000000340ee0
[203009.876442] Call Trace:
[203009.876449]  <IRQ>
[203009.876458]  ? qdisc_put_unlocked+0x30/0x30
[203009.876462]  ? qdisc_put_unlocked+0x30/0x30
[203009.876471]  call_timer_fn+0x2b/0x160
[203009.876477]  ? qdisc_put_unlocked+0x30/0x30
[203009.876482]  expire_timers+0x99/0x110
[203009.876489]  run_timer_softirq+0x8a/0x160
[203009.876497]  ? sched_clock+0x5/0x10
[203009.876503]  ? sched_clock_cpu+0xe/0xd0
[203009.876511]  __do_softirq+0x112/0x356
[203009.876520]  irq_exit+0xd9/0xf0
[203009.876526]  smp_apic_timer_interrupt+0x87/0x180
[203009.876531]  apic_timer_interrupt+0xf/0x20
[203009.876535]  </IRQ>
[203009.876543] RIP: 0010:cpuidle_enter_state+0xbc/0x480
[203009.876547] Code: e8 19 08 a3 ff 80 7c 24 13 00 74 17 9c 58 0f 1f 44 00 00
f6 c4 02 0f 85 99 03 00 00 31 ff e8 0b 2b a9 ff fb 66 0f 1f 44 00 00 <45> 85 e4
0f 88 c4 02 00 00 49 63 cc 4c 8b 3c 24 4c 2b 7c 24 08 48
[203009.876549] RSP: 0018:ffffb9dd40d03e98 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[203009.876553] RAX: ffff9e2c37f00000 RBX: ffffffff988b8180 RCX:
000000000000001f
[203009.876556] RDX: 0000b8a2eb8d9a10 RSI: ffffffff986aa7d8 RDI:
ffffffff986b2e68
[203009.876559] RBP: ffff9e2c37f2ad00 R08: 0000000000000000 R09:
0000000000021500
[203009.876561] R10: 000114870be2a1fa R11: ffff9e2c37f20be4 R12:
0000000000000007
[203009.876564] R13: ffffffff988b8438 R14: 0000000000000007 R15:
0000000000000000
[203009.876574]  ? cpuidle_enter_state+0x97/0x480
[203009.876582]  do_idle+0x217/0x250
[203009.876589]  cpu_startup_entry+0x19/0x20
[203009.876595]  start_secondary+0x1aa/0x200
[203009.876603]  secondary_startup_64+0xa4/0xb0
[203009.876611] ---[ end trace 5d229bf56124be76 ]---
[203009.971055] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop:
666, delay: 100).
[203030.027038] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop:
666, delay: 100).
[203050.080567] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop:
666, delay: 100).
[203069.934074] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop:
666, delay: 100).

I don't even know what this means, in general. How can the frozen system spam
the other system?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20190310/4031b115/attachment-0001.html>