[Bug 59321] New: S4 broken with Haswell

Wed Jun 5 02:27:19 PDT 2013

https://bugzilla.kernel.org/show_bug.cgi?id=59321

           Summary: S4 broken with Haswell
           Product: Drivers
           Version: 2.5
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - Intel)
        AssignedTo: intel-gfx-bugs at lists.freedesktop.org
        ReportedBy: tiwai at suse.de
                CC: intel-gfx-bugs at lists.freedesktop.org
        Regression: No

On laptops with Haswell, the machine hangs up after certain S4 cycles,
typically up to 20 cycles.  3.10-rc4 is most unstable, usually hits in a couple
of S4 cycles.

With luck, you get Oops message like below and goes to death slowly.

 general protection fault: 0000 [#1] SMP 
 CPU: 3 PID: 3804 Comm: packagekitd Tainted: GF            3.10.0-rc4-test+ #1
 task: ffff880231ea8380 ti: ffff88022e138000 task.ti: ffff88q6
 RIP: 0010:[<ffffffff81166ed0>]  [<ffffffff81166ed0>] path_lookupat+0x120/0x830
 RSP: 0018:ffff88022e139cd8  EFLAGS: 00010246
 RAX: 00f9000000f80000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff88022e139d18 RSI: 0000000000000000 RDI: ffff88022ed4e740
 RBP: ffff88022e139d58 R08: ffff88022e139c3f R09: ffff8802358e303e
 R10: ffff88022ed4e778 R11: 0000000000000003 R12: ffff88022ec38da0
 R13: ffff88022e139da8 R14: 0000000000000000 R15: ffff88022e139d08
 FS:  00007f509d934700(0000) GS:ffff88023eac0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f509d924cd8 CR3: 0000000231418000 CR4: 00000000001407e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
  ffff88022e139cf8 0000000000000000 ffff88022e139cf8 ffffffff81162945
  ffff88022e139d98 0000000000000286 ffff8802360673a0 ffff88022ed4e740
  ffff88022ec38da0 0000000000000000 000000d02e139d68 ffff8802358e3000
 Call Trace:
  [<ffffffff81162945>] ? terminate_walk+0x35/0x40
  [<ffffffff81167613>] filename_lookup+0x33/0xd0
  [<ffffffff8116877b>] user_path_at_empty+0x7b/0xb0
  [<ffffffff8117681c>] ? mntput_no_expire+0x4c/0x1b0
  [<ffffffff8115d5c7>] ? cp_new_stat+0x137/0x150
  [<ffffffff811687bc>] user_path_at+0xc/0x10
  [<ffffffff8115d881>] vfs_fstatat+0x51/0xb0
  [<ffffffff8115d949>] vfs_lstat+0x19/0x20
  [<ffffffff8115d96f>] SyS_newlstat+0x1f/0x50
  [<ffffffff814701d2>] system_call_fastpath+0x16/0x1b
 Code: ff ff 83 f8 00 89 c3 0f 85 6e 06 00 00 4c 8b 65 c0 4d 85 e4 0f 84 47 06
00 00 41 f6 44 24 02 04 0f 85 a2 01 00 00 49 8b 44 24 20 <48> 83 78 08 00 0f 84
71 01 00 00 41 83 e6 01 90 0f 84 87 01 00 
 RIP  [<ffffffff81166ed0>] path_lookupat+0x120/0x830
  RSP <ffff88022e139cd8>

The Oops patterns vary quite a lot, but most of them are related with vfs path
lookup.  For example, another typical Oops is something like below (in this
case, it was on 3.0-based kernel with drm/i915 backports, but also seen on all
kernels):
 BUG: soft lockup - CPU#0 stuck for 23s! [sh:11043]
 CPU 0 
 Pid: 11043, comm: sh Tainted: G      D    NX 3.0.65-0.6.6.1.5358.1.PTF-default 
 RIP: 0010:[<ffffffff81445c58>]  [<ffffffff81445c58>] _raw_spin_lock+0x18/0x20
 RSP: 0018:ffff8801b384fc40  EFLAGS: 00000297
 RAX: 000000000000f221 RBX: ffff8801b384fc78 RCX: 0000000000013568
 RDX: 000000000000f220 RSI: ffffc90000878760 RDI: ffffffff81a02700
 RBP: ffffc90000878760 R08: 0000000000000007 R09: 0000000000000025
 R10: 0000000000000007 R11: ffffffff811e17e0 R12: ffffffff8144e2ee
 R13: 0000000000000000 R14: 00000002000200da R15: 0000000000000000
 FS:  00007ff4bd5f1700(0000) GS:ffff8801bfa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ff4bcd88428 CR3: 00000001b3970000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process sh (pid: 11043, threadinfo ffff8801b384e000, task ffff8801543f64c0)
 Stack:
  ffffffff81168fa0 ffff88018d19e838 ffffffff8116aa75 0000000000000000
  ffff8801b331fbc0 ffff8801b331fbc0 ffff8801bec04d58 0000000000000000
  ffffffff811ad190 ffff8801bec04d58 ffff8801b331fbc0 ffff880190f71540
 Call Trace:
  [<ffffffff81168fa0>] inode_sb_list_add+0x10/0x50
  [<ffffffff8116aa75>] iget_locked+0x155/0x170
  [<ffffffff811ad190>] proc_get_inode+0x10/0x110
  [<ffffffff811b3dd9>] proc_lookup_de+0x69/0xe0
  [<ffffffff811adc20>] proc_root_lookup+0x20/0x60
  [<ffffffff8115b012>] d_alloc_and_lookup+0x42/0x80
  [<ffffffff8115c7c5>] do_lookup+0x2a5/0x3a0
  [<ffffffff8115d992>] do_last+0x102/0x800
  [<ffffffff8115ecf9>] path_openat+0xd9/0x420
  [<ffffffff8115f17c>] do_filp_open+0x4c/0xc0
  [<ffffffff8114fdc1>] do_sys_open+0x171/0x1f0
  [<ffffffff8144d912>] system_call_fastpath+0x16/0x1b
  [<00007ff4bcd18da0>] 0x7ff4bcd18d9f
 Code: 0f 95 c0 0f b6 c0 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 b8 00 00 01 00
00 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 <eb> f5 c3 0f 1f 44 00
00 9c 58 0f 1f 44 00 00 48 89 c6 fa 66 0f 

The problem is found on all kernels up to 3.10-rc4.
Also, it's seen on different Haswell variants.  At least, Mobile GT2 and ULT
show the problem.

Some more data points:

- The S4 problem appears both on user-space and kernel hibernation methods.

- S4 cycles are more stable when no network is connected.
  The crash above is seen with the test:
    * running SLED11 user-space with updated X stack, and
    * starting netconsole over Ethernet (r8169 or e1000e drivers)

  Without the network connection, S4 survived once over 100 cycles.
  But it might be just a luck.

- S4 becomes more stable if you disable loading i915 module in initrd.
  On SUSE kernel, i915 module is loaded in initrd, and initrd triggers the
resume of S4 image either via suspend user-space command or writing sysfs.
  When I exclude i915 module by setting $NO_KMS_IN_INITRD in
/etc/sysconfig/kernel and run mkinitrd, the problem is rarely seen.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are the assignee for the bug.