[Intel-gfx] Memory corruption on hibernate/thaw with KMS

Wed Oct 26 05:44:46 CEST 2011

On Wed, 2011-10-12 at 14:09 +1100, Bojan Smojver wrote:
> Bug #41705.

Just to follow up on this a bit, I cloned Linus' tree as of today (i.e.
currently staged stuff for 3.2) then pulled Keith's tree
(git://people.freedesktop.org/~keithp/linux drm-intel-next) over the top
and compiled. Did 26 hibernate/thaw cycles and then went to check the
machine.

Unfortunately, I then got:
---------------------------
[  729.195407] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  729.199345] IP: [<ffffffff8125121d>] __list_add+0x14/0x7f
[  729.203288] PGD 0 
[  729.207051] Oops: 0000 [#1] SMP 
[  729.210874] CPU 0 
[  729.210901] Modules linked in: fuse ppdev parport_pc lp parport bnep bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm thinkpad_acpi snd_timer e1000e uvcvideo rfkill snd videodev media v4l2_compat_ioctl32 qcserial usb_wwan mxm_wmi wmi snd_page_alloc microcode i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr intel_ips soundcore joydev ipv6 firewire_ohci firewire_core crc_itu_t sdhci_pci sdhci mmc_core i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
[  729.231392] 
[  729.235351] Pid: 897, comm: dbus-daemon Tainted: G        W   3.1.0+ #2 LENOVO 4313CTO/4313CTO
[  729.239316] RIP: 0010:[<ffffffff8125121d>]  [<ffffffff8125121d>] __list_add+0x14/0x7f
[  729.243143] RSP: 0018:ffff88022ce57d60  EFLAGS: 00010286
[  729.246941] RAX: ffff8801ab1454d0 RBX: 0000000000000000 RCX: 0000000000000054
[  729.250770] RDX: 0000000000000000 RSI: ffff880229777100 RDI: ffff8801ab145520
[  729.254604] RBP: ffff88022ce57d80 R08: ffff88020c7f28e8 R09: 00007f0aaeda4000
[  729.258294] R10: 0000000000015ff8 R11: 0000000000015fa8 R12: ffff880229777100
[  729.261820] R13: ffff8801ab145520 R14: ffff8802297770b0 R15: ffff8801ab1450b0
[  729.265401] FS:  00007f0aaed80800(0000) GS:ffff88023bc00000(0000) knlGS:0000000000000000
[  729.269032] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  729.272674] CR2: 0000000000000008 CR3: 000000022d275000 CR4: 00000000000006f0
[  729.276338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  729.279988] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  729.283616] Process dbus-daemon (pid: 897, threadinfo ffff88022ce56000, task ffff88022ca55c80)
[  729.287278] Stack:
[  729.290885]  ffff88020c7f28e8 ffff88022c192d80 ffff88022a0bd500 ffff8801ab1454d0
[  729.294593]  ffff88022ce57d90 ffffffff810f1fe5 ffff88022ce57e10 ffffffff81055b6b
[  729.298289]  0000000000000000 ffff8801ab1450e8 ffff8801ab1450f0 ffff8801ab1450c8
[  729.301966] Call Trace:
[  729.305664]  [<ffffffff810f1fe5>] vma_prio_tree_add+0x81/0x95
[  729.309405]  [<ffffffff81055b6b>] dup_mm+0x2f3/0x488
[  729.313143]  [<ffffffff810566db>] copy_process+0x9b1/0x119c
[  729.316888]  [<ffffffff811fdca6>] ? security_file_alloc+0x16/0x18
[  729.320631]  [<ffffffff81129db5>] ? get_empty_filp+0xa4/0x133
[  729.324351]  [<ffffffff81056ff0>] do_fork+0xef/0x22d
[  729.328029]  [<ffffffff813daf9e>] ? sock_alloc_file+0xb3/0x114
[  729.331672]  [<ffffffff810440eb>] ? should_resched+0xe/0x2d
[  729.335300]  [<ffffffff8149a9bd>] ? _cond_resched+0xe/0x22
[  729.338914]  [<ffffffff813d9388>] ? might_fault+0xe/0x10
[  729.342532]  [<ffffffff81016336>] sys_clone+0x28/0x2a
[  729.346128]  [<ffffffff814a2ae3>] stub_clone+0x13/0x20
[  729.349661]  [<ffffffff814a27c2>] ? system_call_fastpath+0x16/0x1b
[  729.353249] Code: ad de 48 b9 00 02 20 00 00 00 ad de 48 89 13 48 89 4b 08 5e 5b 5d c3 55 48 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 89 d3 41 50 <4c> 8b 42 08 49 39 f0 74 20 49 89 d1 48 89 f1 48 c7 c2 98 15 7e 
[  729.361220] RIP  [<ffffffff8125121d>] __list_add+0x14/0x7f
[  729.365218]  RSP <ffff88022ce57d60>
[  729.369206] CR2: 0000000000000008
[  729.439968] ---[ end trace a0f13f2533f6746a ]---
---------------------------

Followed by machine becoming weird and throwing a whole lot more kernel
errors, which I could not capture any more.

The only other error was unrelated. It looked like this:
---------------------------
[  272.029435] ------------[ cut here ]------------
[  272.029441] WARNING: at drivers/net/ethernet/intel/e1000e/ich8lan.c:870 e1000_acquire_swflag_ich8lan+0x4f/0x143 [e1000e]()
[  272.029443] Hardware name: 4313CTO
[  272.029445] e1000e: eth0: contention for Phy access
[  272.029446] Modules linked in: fuse ppdev parport_pc lp parport bnep bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm thinkpad_acpi snd_timer e1000e uvcvideo rfkill snd videodev media v4l2_compat_ioctl32 qcserial usb_wwan mxm_wmi wmi snd_page_alloc microcode i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr intel_ips soundcore joydev ipv6 firewire_ohci firewire_core crc_itu_t sdhci_pci sdhci mmc_core i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
[  272.029480] Pid: 5712, comm: kworker/1:4 Tainted: G        W   3.1.0+ #2
[  272.029482] Call Trace:
[  272.029485]  [<ffffffff81057a36>] warn_slowpath_common+0x83/0x9b
[  272.029488]  [<ffffffff81057af1>] warn_slowpath_fmt+0x46/0x48
[  272.029492]  [<ffffffff81084f41>] ? smp_call_function_single+0x97/0xfd
[  272.029499]  [<ffffffffa0225767>] e1000_acquire_swflag_ich8lan+0x4f/0x143 [e1000e]
[  272.029508]  [<ffffffffa022e5d0>] __e1000_read_phy_reg_hv+0x4d/0x157 [e1000e]
[  272.029518]  [<ffffffffa022ee8a>] e1000_read_phy_reg_hv+0x13/0x15 [e1000e]
[  272.029527]  [<ffffffffa02327b3>] e1000_phy_read_status+0xf6/0x163 [e1000e]
[  272.029537]  [<ffffffffa0236df2>] e1000_watchdog_task+0x104/0x5d2 [e1000e]
[  272.029540]  [<ffffffff8149a93e>] ? __schedule+0x63b/0x669
[  272.029550]  [<ffffffffa0236cee>] ? e1000_update_mng_vlan+0x68/0x68 [e1000e]
[  272.029554]  [<ffffffff8106eab0>] process_one_work+0x176/0x2a9
[  272.029559]  [<ffffffff8106f5be>] worker_thread+0xda/0x15d
[  272.029562]  [<ffffffff8106f4e4>] ? manage_workers+0x176/0x176
[  272.029565]  [<ffffffff81072a0b>] kthread+0x84/0x8c
[  272.029568]  [<ffffffff814a4934>] kernel_thread_helper+0x4/0x10
[  272.029572]  [<ffffffff81072987>] ? kthread_worker_fn+0x148/0x148
[  272.029575]  [<ffffffff814a4930>] ? gs_change+0x13/0x13
[  272.029577] ---[ end trace a0f13f2533f67469 ]---
---------------------------

So, yeah, still there with the latest code.

PS. I will post this comment into the bug too.

-- 
Bojan