[Intel-gfx] [PATCH] drm/i915: Prevent NULL after failed PPGTT
Daniel Vetter
daniel at ffwll.ch
Fri Nov 15 10:02:20 CET 2013
On Thu, Nov 14, 2013 at 05:01:44PM -0800, Ben Widawsky wrote:
> If an object was bound in the ppgtt, and we do a GPU reset, but the
> PPGTT was not brought back up on reset, trying to unbind the object
> later will result in a NULL ptr. Ideally this (failed PPGTT) should
> never happen, but it is allowed in the code, and therefore we should
> prevent the OOPS.
>
> Since Broadwell hangs/reset is still under development, and apparently
> so is aliasing PPGTT after rest, this helps alleviate some of the pain.
>
> NOTE: With the coming PPGTT patches this can't ever occur since there if
> PPGTT is supposed to come up, and doesn't the driver will fail to load
> (since it will make context loading fail).
>
> Here is an example splat:
>
> [ 588.795571] ---[ end trace f23239922ecdffbc ]---
> [ 598.427072] [drm] stuck on render ring
> [ 598.473116] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [ 598.550946] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
> [ 598.663996] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
> [ 598.772830] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
> [ 598.891172] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
> [ 599.006668] [drm] Simulated gpu hang, resetting stop_rings
> [ 599.084004] [drm:__gen6_gt_force_wake_mt_get] *ERROR* Timed out waiting for forcewake old ack to clear.
> [ 599.204258] [drm] PPGTT enable failed. This is not fatal, but unexpected
> [ 599.287287] BUG: unable to handle kernel NULL pointer dereference at 0000000000000108
> [ 599.389563] IP: [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [ 599.484426] PGD 34ab6067 PUD 50b2a067 PMD 0
> [ 599.542964] Oops: 0000 [#1] PREEMPT SMP
> [ 599.597171] Modules linked in: i915 drm_kms_helper drm intel_gtt agpgart i2c_algo_bit i2c_core netconsole configfs ext4 x86_pkg_temp_thermal coretemp crc16 mbcache kvm_intel jbd2 kvm ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode serio_raw evdev thermal fan battery e1000e acpi_cpufreq video ptp button pps_core ac processor snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer snd soundcore hid_generic usbhid hid btrfs libcrc32c xor raid6_pq sd_mod ehci_pci ehci_hcd ahci libahci crc32c_intel libata usbcore scsi_mod usb_common
> [ 600.268119] CPU: 0 PID: 2612 Comm: kms_flip Tainted: G W 3.12.0-BEN+ #38
> [ 600.366889] Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1, BIOS BDW-E1R1.86C.0048.R02.1310291000 10/29/2013
> [ 600.527990] task: ffff88009d28b0c0 ti: ffff88003b5d4000 task.ti: ffff88003b5d4000
> [ 600.626121] RIP: 0010:[<ffffffffa0559794>] [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [ 600.750986] RSP: 0018:ffff88003b5d5ca0 EFLAGS: 00010202
> [ 600.822380] RAX: 0000000000000004 RBX: ffff88008eac5a40 RCX: 00000000000000fe
> [ 600.916281] RDX: 0000000000000000 RSI: ffff88008eac5a40 RDI: ffff88008eac5a40
> [ 601.010181] RBP: ffff88003b5d5cb8 R08: 0000000000000000 R09: 0000000000000000
> [ 601.104083] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 601.197983] R13: ffff88008eac5b30 R14: ffff880145208000 R15: ffff88008eac5ac8
> [ 601.291686] FS: 00007fabc555a8c0(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
> [ 601.397339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 601.474153] CR2: 0000000000000108 CR3: 000000008eadd000 CR4: 00000000003407f0
> [ 601.568035] Stack:
> [ 601.599051] ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30 ffff88003b5d5cf0
> [ 601.696409] ffffffffa055003f ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30
> [ 601.793636] ffff880145208000 ffff88008eac5ac8 ffff88003b5d5d30 ffffffffa05513fe
> [ 601.890889] Call Trace:
> [ 601.927323] [<ffffffffa055003f>] i915_vma_unbind+0x28f/0x340 [i915]
> [ 602.011520] [<ffffffffa05513fe>] i915_gem_free_object+0x9e/0x340 [i915]
> [ 602.100135] [<ffffffff810b81cd>] ? trace_hardirqs_on+0xd/0x10
> [ 602.178010] [<ffffffffa04c248a>] drm_gem_object_free+0x2a/0x30 [drm]
> [ 602.263249] [<ffffffffa04c29fa>] drm_gem_object_handle_unreference_unlocked+0x11a/0x130 [drm]
> [ 602.375214] [<ffffffffa04c2ae6>] drm_gem_handle_delete+0xd6/0x1d0 [drm]
> [ 602.463759] [<ffffffffa04c3358>] drm_gem_close_ioctl+0x28/0x30 [drm]
> [ 602.549031] [<ffffffffa04c0d92>] drm_ioctl+0x502/0x640 [drm]
> [ 602.625820] [<ffffffff8115ac70>] ? might_fault+0xa0/0xb0
> [ 602.698152] [<ffffffff8115ac27>] ? might_fault+0x57/0xb0
> [ 602.770831] [<ffffffff8100f0ec>] ? __restore_xstate_sig+0x13c/0x600
> [ 602.855035] [<ffffffff811bb6c5>] do_vfs_ioctl+0x305/0x530
> [ 602.928680] [<ffffffff811c73a7>] ? fget_light+0x387/0x4f0
> [ 603.001415] [<ffffffff811bb971>] SyS_ioctl+0x81/0xa0
> [ 603.069506] [<ffffffff814dd6d6>] system_call_fastpath+0x1a/0x1f
> [ 603.148332] Code: 89 e7 89 c2 41 ff d6 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 48 89 f7 53 <4d> 8b ac 24 08 01 00 00 48 8b 56 08 48 8b 9e b8 00 00 00 48 8b
> [ 603.392564] RIP [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [ 603.487472] RSP <ffff88003b5d5ca0>
> [ 603.535732] CR2: 0000000000000108
> [ 603.622175] ---[ end trace f23239922ecdffbd ]---
>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
> drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40d9dcf..2b9245d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2741,6 +2741,13 @@ int i915_vma_unbind(struct i915_vma *vma)
>
> if (obj->has_global_gtt_mapping)
> i915_gem_gtt_unbind_object(obj);
> +
> + if (unlikely(!dev_priv->mm.aliasing_ppgtt &&
> + obj->has_aliasing_ppgtt_mapping)) {
> + DRM_DEBUG_DRIVER("Leftover PPGTT mapping after reset\n");
> + obj->has_aliasing_ppgtt_mapping = 0;
> + }
Nack on sprinkling band-aids all over the place when imo the real bug is
that the our init_hw functions destroys sw tracking state behind our back:
http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg27025.html
If we rip out the i915_gem_cleanup_aliasing_ppgtt we shouldn't blow up,
but can keep on going (without the gt ofc).
-Daniel
> +
> if (obj->has_aliasing_ppgtt_mapping) {
> i915_ppgtt_unbind_object(dev_priv->mm.aliasing_ppgtt, obj);
> obj->has_aliasing_ppgtt_mapping = 0;
> --
> 1.8.4.2
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
More information about the Intel-gfx
mailing list