[Intel-gfx] [PATCH] drm/i915: fix gpu hang vs. flip stall deadlocks
Daniel Vetter
daniel at ffwll.ch
Thu Sep 5 14:38:38 CEST 2013
On Wed, Sep 04, 2013 at 05:36:14PM +0200, Daniel Vetter wrote:
> Since we've started to clean up pending flips when the gpu hangs in
>
> commit 96a02917a0131e52efefde49c2784c0421d6c439
> Author: Ville Syrjälä <ville.syrjala at linux.intel.com>
> Date: Mon Feb 18 19:08:49 2013 +0200
>
> drm/i915: Finish page flips and update primary planes after a GPU reset
>
> the gpu reset work now also grabs modeset locks. But since since work
> items on our private work queue are not allowed to do that due to the
> flush_workqueue from the pageflip code this results in a neat
> deadlock:
>
> INFO: task kms_flip:14676 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kms_flip D ffff88019283a5c0 0 14676 13344 0x00000004
> ffff88018e62dbf8 0000000000000046 ffff88013bdb12e0 ffff88018e62dfd8
> ffff88018e62dfd8 00000000001d3b00 ffff88019283a5c0 ffff88018ec21000
> ffff88018f693f00 ffff88018eece000 ffff88018e62dd60 ffff88018eece898
> Call Trace:
> [<ffffffff8138ee7b>] schedule+0x60/0x62
> [<ffffffffa046c0dd>] intel_crtc_wait_for_pending_flips+0xb2/0x114 [i915]
> [<ffffffff81050ff4>] ? finish_wait+0x60/0x60
> [<ffffffffa0478041>] intel_crtc_set_config+0x7f3/0x81e [i915]
> [<ffffffffa031780a>] drm_mode_set_config_internal+0x4f/0xc6 [drm]
> [<ffffffffa0319cf3>] drm_mode_setcrtc+0x44d/0x4f9 [drm]
> [<ffffffff810e44da>] ? might_fault+0x38/0x86
> [<ffffffffa030d51f>] drm_ioctl+0x2f9/0x447 [drm]
> [<ffffffff8107a722>] ? trace_hardirqs_off+0xd/0xf
> [<ffffffffa03198a6>] ? drm_mode_setplane+0x343/0x343 [drm]
> [<ffffffff8112222f>] ? mntput_no_expire+0x3e/0x13d
> [<ffffffff81117f33>] vfs_ioctl+0x18/0x34
> [<ffffffff81118776>] do_vfs_ioctl+0x396/0x454
> [<ffffffff81396b37>] ? sysret_check+0x1b/0x56
> [<ffffffff81118886>] SyS_ioctl+0x52/0x7d
> [<ffffffff81396b12>] system_call_fastpath+0x16/0x1b
> 2 locks held by kms_flip/14676:
> #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa0316545>] drm_modeset_lock_all+0x22/0x59 [drm]
> #1: (&crtc->mutex){+.+.+.}, at: [<ffffffffa031656b>] drm_modeset_lock_all+0x48/0x59 [drm]
> INFO: task kworker/u8:4:175 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u8:4 D ffff88018de9a5c0 0 175 2 0x00000000
> Workqueue: i915 i915_error_work_func [i915]
> ffff88018e37dc30 0000000000000046 ffff8801938ab8a0 ffff88018e37dfd8
> ffff88018e37dfd8 00000000001d3b00 ffff88018de9a5c0 ffff88018ec21018
> 0000000000000246 ffff88018e37dca0 000000005a865a86 ffff88018de9a5c0
> Call Trace:
> [<ffffffff8138ee7b>] schedule+0x60/0x62
> [<ffffffff8138f23d>] schedule_preempt_disabled+0x9/0xb
> [<ffffffff8138d0cd>] mutex_lock_nested+0x205/0x3b1
> [<ffffffffa0477094>] ? intel_display_handle_reset+0x7e/0xbd [i915]
> [<ffffffffa0477094>] ? intel_display_handle_reset+0x7e/0xbd [i915]
> [<ffffffffa0477094>] intel_display_handle_reset+0x7e/0xbd [i915]
> [<ffffffffa044e0a2>] i915_error_work_func+0x128/0x147 [i915]
> [<ffffffff8104a89a>] process_one_work+0x1d4/0x35a
> [<ffffffff8104a821>] ? process_one_work+0x15b/0x35a
> [<ffffffff8104b4a5>] worker_thread+0x144/0x1f0
> [<ffffffff8104b361>] ? rescuer_thread+0x275/0x275
> [<ffffffff8105076d>] kthread+0xac/0xb4
> [<ffffffff81059d30>] ? finish_task_switch+0x3b/0xc0
> [<ffffffff810506c1>] ? __kthread_parkme+0x60/0x60
> [<ffffffff81396a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffff810506c1>] ? __kthread_parkme+0x60/0x60
> 3 locks held by kworker/u8:4/175:
> #0: (i915){.+.+.+}, at: [<ffffffff8104a821>] process_one_work+0x15b/0x35a
> #1: ((&dev_priv->gpu_error.work)){+.+.+.}, at: [<ffffffff8104a821>] process_one_work+0x15b/0x35a
> #2: (&crtc->mutex){+.+.+.}, at: [<ffffffffa0477094>] intel_display_handle_reset+0x7e/0xbd [i915]
>
> This blew up while running kms_flip/flip-vs-panning-vs-hang-interruptible
> on one of my older machines.
>
> Unfortunately (despite the proper lockdep annotations for
> flush_workqueue) lockdep still doesn't detect this correctly, so we
> need to rely on change to discover these bugs.
>
> Apply the usual bugfix and schedule the reset work on the system
> workqueue to keep our own driver workqueue free of any modeset lock
> grabbing.
>
> Note that this is not a terribly serious regression since before the
> offending commit we'd simply have stalled userspace forever due to
> failing to abort all outstanding pageflips.
>
> v2: Add a comment as requested by Chris.
>
> Cc: Thomas Gleixner <tglx at linutronix.de>
> Cc: stable at vger.kernel.org
> Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
Picked up for -fixes with a bit of the spelling fail rectified and Chris'
irc r-b added (he still suffers from overtly hungry mail eating demons in
his cellar ...).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
More information about the Intel-gfx
mailing list