[Bug 92545] New: [BSW] GPU Hang leads to sporadic kernel crashes

Mon Oct 19 12:27:10 PDT 2015

https://bugs.freedesktop.org/show_bug.cgi?id=92545

            Bug ID: 92545
           Summary: [BSW] GPU Hang leads to sporadic kernel crashes
           Product: DRI
           Version: XOrg git
          Hardware: x86-64 (AMD64)
                OS: other
            Status: NEW
          Severity: major
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: dhinakaran.pandiyan at intel.com
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

Kernel crashes after a GPU reset. The GPU hang is frequent and happens during
boot time. However, the GPU hang occasionally results in a kernel crash. This
has been observed on Chrome OS with a 3.18 kernel that has i915 backports.

I believe that the NULL pointer access happens at
I915_WRITE(DSPSURF(intel_crtc->plane), intel_crtc->unpin_work->gtt_offset); in
intel_display.c:ilk_do_mmio_flip
If we assume an ongoing reset, then the call sequence
intel_finish_reset -> intel_complete_page_flips -> intel_finish_page_flip_plane
-> do_intel_finish_page_flip -> page_flip_completed
might set intel_crtc->unpin_work = NULL.

We need some help to debug this crash.

<6>[    6.744129] [drm] stuck on render ring
<6>[    6.766343] [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung,
action: reset
<6>[    6.766356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
stack, including userspace.
<6>[    6.766367] [drm] Please file a _new_ bug report on bugs.freedesktop.org
against DRI -> DRM/Intel
<6>[    6.766378] [drm] drm/i915 developers can then reassign to the right
component if it's not a kernel issue.
<6>[    6.766389] [drm] The gpu crash dump is required to analyze gpu hangs, so
please always attach it.
<6>[    6.766400] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<5>[    6.769207] drm/i915: Resetting chip after gpu hang
<6>[   12.739947] [drm] stuck on render ring
<6>[   12.765654] [drm] GPU HANG: ecode 8:0:0x86dffffd, in chrome [3652],
reason: Ring hung, action: reset
<4>[   12.765733] ------------[ cut here ]------------
<4>[   12.765764] WARNING: CPU: 2 PID: 41 at
/mnt/host/source/src/third_party/kernel/v3.18/drivers/gpu/drm/i915/intel_display.c:11277
intel_mmio_flip_work_func+0x6d/0x315()
<4>[   12.765787] WARN_ON(__i915_wait_request(mmio_flip->req,
mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
<4>[   12.765805] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
cros_ec_sensors ip6table_filter cros_ec_sensors_core
industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio
iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx
usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev
ppp_async ppp_generic slhc tun
<4>[   12.765930] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted
3.18.0-06623-g902cb99 #1
<4>[   12.765944] Hardware name: GOOGLE Cyan, BIOS
Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   12.765962] Workqueue: events intel_mmio_flip_work_func
<4>[   12.765974]  0000000000000000 000000004607d413 ffff88017a9bbcc8
ffffffff8d5f3d15
<4>[   12.765996]  0000000000000000 ffff88017a9bbd20 ffff88017a9bbd08
ffffffff8d03dfd9
<4>[   12.766016]  ffff88017a9bbcd8 ffffffff8d345524 ffff88017a97f000
ffff880072f2da00
<4>[   12.766037] Call Trace:
<4>[   12.766054]  [<ffffffff8d5f3d15>] ? dump_stack+0x46/0x58
<4>[   12.766070]  [<ffffffff8d03dfd9>] ? warn_slowpath_common+0x81/0x9b
<4>[   12.766085]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   12.766100]  [<ffffffff8d03e048>] ? warn_slowpath_fmt+0x55/0x6b
<4>[   12.766115]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   12.766133]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   12.766149]  [<ffffffff8d051a1b>] ? process_one_work+0x175/0x2ab
<4>[   12.766163]  [<ffffffff8d052c95>] ? worker_thread+0x1fb/0x2ce
<4>[   12.766178]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   12.766192]  [<ffffffff8d056863>] ? kthread+0x10e/0x116
<4>[   12.766207]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   12.766222]  [<ffffffff8d5f8bac>] ? ret_from_fork+0x7c/0xb0
<4>[   12.766237]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   12.766249] ---[ end trace 8d614c29c562a829 ]---
<5>[   12.767790] drm/i915: Resetting chip after gpu hang
<6>[   18.740012] [drm] stuck on render ring
<6>[   18.760304] [drm] GPU HANG: ecode 8:0:0x86dffffd, in chrome [3652],
reason: Ring hung, action: reset
<4>[   18.760635] ------------[ cut here ]------------
<4>[   18.760665] WARNING: CPU: 0 PID: 1099 at
/mnt/host/source/src/third_party/kernel/v3.18/drivers/gpu/drm/i915/intel_display.c:11277
intel_mmio_flip_work_func+0x6d/0x315()
<4>[   18.760688] WARN_ON(__i915_wait_request(mmio_flip->req,
mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
<4>[   18.760706] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
cros_ec_sensors ip6table_filter cros_ec_sensors_core
industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio
iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx
usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev
ppp_async ppp_generic slhc tun
<4>[   18.760823] CPU: 0 PID: 1099 Comm: kworker/0:2 Tainted: G        W     
3.18.0-06623-g902cb99 #1
<4>[   18.760837] Hardware name: GOOGLE Cyan, BIOS
Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   18.760854] Workqueue: events intel_mmio_flip_work_func
<4>[   18.760866]  0000000000000000 00000000e8ee967d ffff8801760b3cc8
ffffffff8d5f3d15
<4>[   18.760886]  0000000000000000 ffff8801760b3d20 ffff8801760b3d08
ffffffff8d03dfd9
<4>[   18.760905]  ffff8801760b3cd8 ffffffff8d345524 ffff8801799ceb40
ffff8801798a50c0
<4>[   18.760925] Call Trace:
<4>[   18.760940]  [<ffffffff8d5f3d15>] ? dump_stack+0x46/0x58
<4>[   18.760955]  [<ffffffff8d03dfd9>] ? warn_slowpath_common+0x81/0x9b
<4>[   18.760969]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   18.760983]  [<ffffffff8d03e048>] ? warn_slowpath_fmt+0x55/0x6b
<4>[   18.760997]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   18.761014]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   18.761028]  [<ffffffff8d051a1b>] ? process_one_work+0x175/0x2ab
<4>[   18.761042]  [<ffffffff8d052c95>] ? worker_thread+0x1fb/0x2ce
<4>[   18.761055]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   18.761069]  [<ffffffff8d056863>] ? kthread+0x10e/0x116
<4>[   18.761083]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.761096]  [<ffffffff8d5f8bac>] ? ret_from_fork+0x7c/0xb0
<4>[   18.761110]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.761121] ---[ end trace 8d614c29c562a82a ]---
<5>[   18.763443] drm/i915: Resetting chip after gpu hang
<1>[   18.769490] BUG: unable to handle kernel NULL pointer dereference at
0000000000000048
<1>[   18.769515] IP: [<ffffffff8d345716>]
intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.769536] PGD 0 
<4>[   18.769544] Oops: 0000 [#1] SMP 
<0>[   18.773130] gsmi: Log Shutdown Reason 0x03
<4>[   18.773140] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
cros_ec_sensors ip6table_filter cros_ec_sensors_core
industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio
iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx
usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev
ppp_async ppp_generic slhc tun
<4>[   18.773241] CPU: 0 PID: 1099 Comm: kworker/0:2 Tainted: G        W     
3.18.0-06623-g902cb99 #1
<4>[   18.773255] Hardware name: GOOGLE Cyan, BIOS
Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   18.773272] Workqueue: events intel_mmio_flip_work_func
<4>[   18.773283] task: ffff880179bfea80 ti: ffff8801760b0000 task.ti:
ffff8801760b0000
<4>[   18.773296] RIP: 0010:[<ffffffff8d345716>]  [<ffffffff8d345716>]
intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.773314] RSP: 0018:ffff8801760b3d88  EFLAGS: 00010096
<4>[   18.773324] RAX: 0000000000000000 RBX: ffff88017b2b7000 RCX:
0000000000180000
<4>[   18.773337] RDX: 00000000001e1180 RSI: 0000000000000046 RDI:
ffff88017a080000
<4>[   18.773349] RBP: ffff8801760b3de8 R08: 0000000000000001 R09:
ffff88017b2b7000
<4>[   18.773361] R10: 0000000000000000 R11: 000000000000b910 R12:
ffff88017a080000
<4>[   18.773373] R13: 00000000001f0180 R14: ffff8801798a50c0 R15:
ffff8801741c5680
<4>[   18.773386] FS:  0000000000000000(0000) GS:ffff88017fc00000(0000)
knlGS:0000000000000000
<4>[   18.773399] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[   18.773410] CR2: 0000000000000048 CR3: 0000000077e06000 CR4:
00000000001007f0
<4>[   18.773422] Stack:
<4>[   18.773427]  ffff8801760b3db8 ffffffff8d05c849 ffff88017a99c000
ffff880078ac0b40
<4>[   18.773446]  000003dc77f04c10 00000000e8ee967d ffff8801760b3e28
ffff8801799ceb40
<4>[   18.773463]  ffff8801798a50c0 ffff88017fc11780 0000000000000000
ffff88017fc15b00
<4>[   18.773481] Call Trace:
<4>[   18.773495]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   18.773510]  [<ffffffff8d051a1b>] process_one_work+0x175/0x2ab
<4>[   18.773523]  [<ffffffff8d052c95>] worker_thread+0x1fb/0x2ce
<4>[   18.773535]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   18.773548]  [<ffffffff8d056863>] kthread+0x10e/0x116
<4>[   18.773561]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.773575]  [<ffffffff8d5f8bac>] ret_from_fork+0x7c/0xb0
<4>[   18.773587]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.773597] Code: 00 c0 74 05 80 cc 04 89 c2 b9 01 00 00 00 4c 89 ee 4c
89 e7 41 ff 94 24 d8 00 00 00 48 8b 83 48 07 00 00 41 8b 4c 24 20 4c 89 e7 <8b>
50 48 8b 83 24 04 00 00 41 8b 44 84 30 41 2b 44 24 30 8d b4 
<1>[   18.773724] RIP  [<ffffffff8d345716>]
intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.773739]  RSP <ffff8801760b3d88>
<4>[   18.773747] CR2: 0000000000000048
<4>[   18.773756] ---[ end trace 8d614c29c562a82b ]---
<0>[   18.781967] Kernel panic - not syncing: Fatal exception
<0>[   18.782089] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)
<0>[   18.782293] gsmi: Log Shutdown Reason 0x02

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20151019/c7c9ba5d/attachment.html>