[Bug 106832] New: [i915]Kernel panic occurs after ECHO 15 to /sys/kernel/debug/dri/0/i915_wedged

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Jun 6 05:42:17 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=106832

            Bug ID: 106832
           Summary: [i915]Kernel panic occurs after ECHO 15 to
                    /sys/kernel/debug/dri/0/i915_wedged
           Product: DRI
           Version: unspecified
          Hardware: Other
                OS: Linux (All)
            Status: NEW
          Severity: major
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: owen.zhang at intel.com
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

Created attachment 140042
  --> https://bugs.freedesktop.org/attachment.cgi?id=140042&action=edit
for reproducer

1. The console message:
[  489.724739] [drm] GPU HANG: ecode 9:0:0xfef77ffe, reason: Manually setting
wedged to 15, action: reset
[  489.733997] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  489.739675] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[  489.745400] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  489.751182] i915 0000:00:02.0: Resetting chip after gpu hang
[  489.756901] [drm] RC6 on
[  497.695484] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  497.701381] BUG: unable to handle kernel NULL pointer dereference at
0000000000000070
[  497.709173] IP: reset_common_ring+0x23/0x140 [i915]
[  497.714012] PGD 0 P4D 0
[  497.716526] Oops: 0000 [#1] SMP PTI
[  497.719988] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security
iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
snd_hda_codec_hdmi rfkill snd_hda_codec_realtek intel_rapl
snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel intel_powerclamp
snd_hda_codec coretemp kvm_intel snd_hda_core kvm snd_hwdep snd_seq
snd_seq_device snd_pcm snd_timer irqbypass mei_me snd crct10dif_pclmul dell_wmi
crc32_pclmul
[  497.790676]  ghash_clmulni_intel pcbc iTCO_wdt dell_smbios
iTCO_vendor_support mei aesni_intel crypto_simd sparse_keymap wmi_bmof dcdbas
glue_helper cryptd shpchp soundcore nfsd sg i2c_i801 pcspkr acpi_pad
auth_rpcgss wmi nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod i915
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
e1000e ahci libahci ptp libata i2c_core pps_core crc32c_intel serio_raw video
dm_mirror dm_region_hash dm_log dm_mod
[  497.832418] CPU: 4 PID: 308 Comm: kworker/4:2 Not tainted 4.14.20-mss-pv5+
#13
[  497.839583] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.1.1
10/07/2015
[  497.846932] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  497.852889] task: ffff9e931c3e4740 task.stack: ffffb9ec413d4000
[  497.858773] RIP: 0010:reset_common_ring+0x23/0x140 [i915]
[  497.864128] RSP: 0018:ffffb9ec413d7c50 EFLAGS: 00010246
[  497.869312] RAX: ffffffffc03e9290 RBX: ffff9e9323745200 RCX:
0000000000000006
[  497.876392] RDX: 0000000000000000 RSI: ffff9e9323745200 RDI:
0000000000000000
[  497.883469] RBP: ffff9e9246426000 R08: 0000000000000000 R09:
00000000000009f9
[  497.890547] R10: 0000000000000007 R11: 0000000000000000 R12:
ffff9e9323745200
[  497.897628] R13: 00000000ffffffff R14: ffff9e931a09cdc8 R15:
ffff9e9246426000
[  497.904708] FS:  0000000000000000(0000) GS:ffff9e9331d00000(0000)
knlGS:0000000000000000
[  497.912734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  497.918435] CR2: 0000000000000070 CR3: 00000001ee00a005 CR4:
00000000003606e0
[  497.925515] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  497.932594] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  497.939671] Call Trace:
[  497.942105]  i915_reset_engine+0x5e/0xf0 [i915]
[  497.946606]  i915_handle_error+0x21e/0x3e0 [i915]
[  497.951274]  ? vsnprintf+0x203/0x4d0
[  497.954820]  ? vscnprintf+0x9/0x20
[  497.958193]  ? scnprintf+0x49/0x70
[  497.961575]  hangcheck_declare_hang+0xd8/0x110 [i915]
[  497.966595]  ? fwtable_read32+0x83/0x190 [i915]
[  497.971095]  i915_hangcheck_elapsed+0x2cf/0x380 [i915]
[  497.976196]  process_one_work+0x141/0x340
[  497.980183]  worker_thread+0x47/0x3e0
[  497.983820]  kthread+0xfc/0x130
[  497.986943]  ? rescuer_thread+0x380/0x380
[  497.990928]  ? kthread_park+0x60/0x60
[  497.994566]  ? do_syscall_64+0x6f/0x1a0
[  497.998378]  ? SyS_exit_group+0x10/0x10
[  498.002190]  ret_from_fork+0x35/0x40
[  498.005739] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 48 85 f6 55
48 89 fd 53 48 89 f3 0f 84 db 00 00 00 48 8b bf 88 03 00 00 48 83 e7 fc <48> 8b
47 70 48 39 46 70 74 29 48 85 ff 74 0b f0 ff 0f 0f 88 93
[  498.024481] RIP: reset_common_ring+0x23/0x140 [i915] RSP: ffffb9ec413d7c50
[  498.031301] CR2: 0000000000000070
[  498.034592] ---[ end trace e7c5283a77cf3e17 ]---
[  498.039170] Kernel panic - not syncing: Fatal exception
[  498.044399] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
[  498.055011] ---[ end Kernel panic - not syncing: Fatal exception
[  498.060975] ------------[ cut here ]------------
[  498.065557] WARNING: CPU: 4 PID: 308 at kernel/sched/core.c:1178
set_task_cpu+0x184/0x190
[  498.073670] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security
iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
snd_hda_codec_hdmi rfkill snd_hda_codec_realtek intel_rapl
snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel intel_powerclamp
snd_hda_codec coretemp kvm_intel snd_hda_core kvm snd_hwdep snd_seq
snd_seq_device snd_pcm snd_timer irqbypass mei_me snd crct10dif_pclmul dell_wmi
crc32_pclmul
[  498.144270]  ghash_clmulni_intel pcbc iTCO_wdt dell_smbios
iTCO_vendor_support mei aesni_intel crypto_simd sparse_keymap wmi_bmof dcdbas
glue_helper cryptd shpchp soundcore nfsd sg i2c_i801 pcspkr acpi_pad
auth_rpcgss wmi nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod i915
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
e1000e ahci libahci ptp libata i2c_core pps_core crc32c_intel serio_raw video
dm_mirror dm_region_hash dm_log dm_mod
[  498.185999] CPU: 4 PID: 308 Comm: kworker/4:2 Tainted: G      D        
4.14.20-mss-pv5+ #13
[  498.194369] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.1.1
10/07/2015
[  498.201713] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  498.207673] task: ffff9e931c3e4740 task.stack: ffffb9ec413d4000
[  498.213546] RIP: 0010:set_task_cpu+0x184/0x190
[  498.217961] RSP: 0018:ffff9e9331d03cf8 EFLAGS: 00010046
[  498.223153] RAX: 0000000000000200 RBX: ffff9e93184b0000 RCX:
0000000000000001
[  498.230248] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff9e93184b0000
[  498.237339] RBP: 0000000000022600 R08: 00000000000000ff R09:
0000000000000000
[  498.244427] R10: 000000005b0e7cd0 R11: 000000002276b083 R12:
0000000000000000
[  498.251507] R13: 0000000000000000 R14: 0000000000000046 R15:
0000000000000000
[  498.258588] FS:  0000000000000000(0000) GS:ffff9e9331d00000(0000)
knlGS:0000000000000000
[  498.266616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  498.272317] CR2: 0000000000000070 CR3: 00000001ee00a005 CR4:
00000000003606e0
[  498.279398] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  498.286478] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  498.293556] Call Trace:
[  498.295980]  <IRQ>
[  498.297974]  try_to_wake_up+0x161/0x440
[  498.301778]  __wake_up_common+0x8a/0x150
[  498.305669]  ep_poll_callback+0xc9/0x2e0
[  498.309560]  __wake_up_common+0x8a/0x150
[  498.313450]  __wake_up_common_lock+0x7a/0xc0
[  498.317686]  irq_work_run_list+0x48/0x70
[  498.321576]  ? tick_sched_do_timer+0x60/0x60
[  498.325813]  update_process_times+0x3b/0x50
[  498.329962]  tick_sched_handle+0x26/0x60
[  498.333853]  tick_sched_timer+0x34/0x70
[  498.337659]  __hrtimer_run_queues+0xdc/0x220
[  498.341896]  hrtimer_interrupt+0x99/0x190
[  498.345872]  smp_apic_timer_interrupt+0x5a/0x120
[  498.350452]  apic_timer_interrupt+0xa2/0xb0
[  498.354601]  </IRQ>
[  498.356683] RIP: 0010:panic+0x1fa/0x23c
[  498.360488] RSP: 0018:ffffb9ec413d7a10 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[  498.367999] RAX: 0000000000000034 RBX: 0000000000000000 RCX:
0000000000000006
[  498.375080] RDX: 0000000000000000 RSI: 0000000000000096 RDI:
ffff9e9331d16970
[  498.382161] RBP: ffffb9ec413d7a80 R08: 0000000000000000 R09:
0000000000000a27
[  498.389238] R10: 0000000000000000 R11: ffffb9ec413d7780 R12:
ffffffff88e44d50
[  498.396317] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000001
[  498.403396]  oops_end+0xaf/0xc0
[  498.406510]  no_context+0x1b3/0x400
[  498.409971]  __do_page_fault+0x97/0x4d0
[  498.413778]  do_page_fault+0x33/0x120
[  498.417410]  page_fault+0x2c/0x60
[  498.420706] RIP: 0010:reset_common_ring+0x23/0x140 [i915]
[  498.426063] RSP: 0018:ffffb9ec413d7c50 EFLAGS: 00010246
[  498.431245] RAX: ffffffffc03e9290 RBX: ffff9e9323745200 RCX:
0000000000000006
[  498.438326] RDX: 0000000000000000 RSI: ffff9e9323745200 RDI:
0000000000000000
[  498.445404] RBP: ffff9e9246426000 R08: 0000000000000000 R09:
00000000000009f9
[  498.452485] R10: 0000000000000007 R11: 0000000000000000 R12:
ffff9e9323745200
[  498.459579] R13: 00000000ffffffff R14: ffff9e931a09cdc8 R15:
ffff9e9246426000
[  498.466682]  ? port_assign+0x60/0x60 [i915]
[  498.470844]  i915_reset_engine+0x5e/0xf0 [i915]
[  498.475355]  i915_handle_error+0x21e/0x3e0 [i915]
[  498.480034]  ? vsnprintf+0x203/0x4d0
[  498.483582]  ? vscnprintf+0x9/0x20
[  498.486955]  ? scnprintf+0x49/0x70
[  498.490338]  hangcheck_declare_hang+0xd8/0x110 [i915]
[  498.495361]  ? fwtable_read32+0x83/0x190 [i915]
[  498.499864]  i915_hangcheck_elapsed+0x2cf/0x380 [i915]
[  498.504963]  process_one_work+0x141/0x340
[  498.508941]  worker_thread+0x47/0x3e0
[  498.512575]  kthread+0xfc/0x130
[  498.515690]  ? rescuer_thread+0x380/0x380
[  498.519667]  ? kthread_park+0x60/0x60
[  498.523301]  ? do_syscall_64+0x6f/0x1a0
[  498.527106]  ? SyS_exit_group+0x10/0x10
[  498.530912]  ret_from_fork+0x35/0x40
[  498.534458] Code: ff 80 8b ac 08 00 00 04 e9 2b ff ff ff 0f ff e9 c7 fe ff
ff f7 83 84 00 00 00 fd ff ff ff 0f 84 d1 fe ff ff 0f ff e9 ca fe ff ff <0f> ff
e9 d9 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 55 49 89
[  498.553187] ---[ end trace e7c5283a77cf3e18 ]---
[  498.557769] sched: Unexpected reschedule of offline CPU#0!

        2. This kernel panic only reproduce on 4.14 LTS. And can't reproduce
the drm-tip/kernel org latest version.
I also git bisect the fix patch from drm-tip, the git bisect log attached, and
the fixed patch attached. 
Pls help to pay attention:
"good" means can reproduce this issue.
"bad" means can't reproduce this issue. it means fixed this issue.

        3. The fix patch comment in drm-tip:
commit 221ab9719bf33ad2984928d2afb20988d652a289
Author: Chris Wilson <chris at chris-wilson.co.uk>
Date:   Sat Sep 16 21:44:14 2017 +0100

    drm/i915/execlists: Unwind incomplete requests on resets

    Given the mechanism to unwind and replay requests (designed to support
    preemption), we have an alternative to the current method of
    resubmitting the ELSP upon reset. Resubmitting ELSP turns out to be more
    complicated than expected, due to having to handle lost context-switch
    interrupts and so guessing what ELSP we need to resubmit later. Instead,
    by unwinding the requests and clearing the ELSP tracking entirely, we
    can then just dequeue the first pair of ready requests after resetting,
    using the normal submission procedure.

    Currently, the unwound requests have maximum priority and so are
    guaranteed to be resubmitted upon resume. If we are lucky, we may be
    able to coalesce a new request on top!

    Suggested-by: Michał Winiarski <michal.winiarski at intel.com>
    Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
    Cc: Michał Winiarski <michal.winiarski at intel.com>
    Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
    Link:
https://patchwork.freedesktop.org/patch/msgid/20170916204414.32762-4-chris@chris-wilson.co.uk
    Reviewed-by: Michał Winiarski <michal.winiarski at intel.com>

        4. The fix patch in kernel org:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/gpu/drm/i915/intel_lrc.c?h=v4.16.13&id=221ab9719bf33ad2984928d2afb20988d652a289

        5. The reproduce steps:
  1) Build this stack:
https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
        2) Run the following case on one terminal.
        ./repro.sh
        3) Meanwhile run the following cmd in another terminal:
        echo 15 > /sys/kernel/debug/dri/0/i915_wedged

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20180606/f8b3931b/attachment-0001.html>


More information about the intel-gfx-bugs mailing list