[Intel-gfx] [PATCH] drm/i915: Prevent NULL after failed PPGTT

Daniel Vetter daniel at ffwll.ch
Fri Nov 15 10:02:20 CET 2013


On Thu, Nov 14, 2013 at 05:01:44PM -0800, Ben Widawsky wrote:
> If an object was bound in the ppgtt, and we do a GPU reset, but the
> PPGTT was not brought back up on reset, trying to unbind the object
> later will result in a NULL ptr. Ideally this (failed PPGTT) should
> never happen, but it is allowed in the code, and therefore we should
> prevent the OOPS.
> 
> Since Broadwell hangs/reset is still under development, and apparently
> so is aliasing PPGTT after rest, this helps alleviate some of the pain.
> 
> NOTE: With the coming PPGTT patches this can't ever occur since there if
> PPGTT is supposed to come up, and doesn't the driver will fail to load
> (since it will make context loading fail).
> 
> Here is an example splat:
> 
> [  588.795571] ---[ end trace f23239922ecdffbc ]---
> [  598.427072] [drm] stuck on render ring
> [  598.473116] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [  598.550946] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
> [  598.663996] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
> [  598.772830] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
> [  598.891172] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
> [  599.006668] [drm] Simulated gpu hang, resetting stop_rings
> [  599.084004] [drm:__gen6_gt_force_wake_mt_get] *ERROR* Timed out waiting for forcewake old ack to clear.
> [  599.204258] [drm] PPGTT enable failed. This is not fatal, but unexpected
> [  599.287287] BUG: unable to handle kernel NULL pointer dereference at 0000000000000108
> [  599.389563] IP: [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [  599.484426] PGD 34ab6067 PUD 50b2a067 PMD 0
> [  599.542964] Oops: 0000 [#1] PREEMPT SMP
> [  599.597171] Modules linked in: i915 drm_kms_helper drm intel_gtt agpgart i2c_algo_bit i2c_core netconsole configfs ext4 x86_pkg_temp_thermal coretemp crc16 mbcache kvm_intel jbd2 kvm ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode serio_raw evdev thermal fan battery e1000e acpi_cpufreq video ptp button pps_core ac processor snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer snd soundcore hid_generic usbhid hid btrfs libcrc32c xor raid6_pq sd_mod ehci_pci ehci_hcd ahci libahci crc32c_intel libata usbcore scsi_mod usb_common
> [  600.268119] CPU: 0 PID: 2612 Comm: kms_flip Tainted: G        W    3.12.0-BEN+ #38
> [  600.366889] Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1, BIOS BDW-E1R1.86C.0048.R02.1310291000 10/29/2013
> [  600.527990] task: ffff88009d28b0c0 ti: ffff88003b5d4000 task.ti: ffff88003b5d4000
> [  600.626121] RIP: 0010:[<ffffffffa0559794>]  [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [  600.750986] RSP: 0018:ffff88003b5d5ca0  EFLAGS: 00010202
> [  600.822380] RAX: 0000000000000004 RBX: ffff88008eac5a40 RCX: 00000000000000fe
> [  600.916281] RDX: 0000000000000000 RSI: ffff88008eac5a40 RDI: ffff88008eac5a40
> [  601.010181] RBP: ffff88003b5d5cb8 R08: 0000000000000000 R09: 0000000000000000
> [  601.104083] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  601.197983] R13: ffff88008eac5b30 R14: ffff880145208000 R15: ffff88008eac5ac8
> [  601.291686] FS:  00007fabc555a8c0(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
> [  601.397339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  601.474153] CR2: 0000000000000108 CR3: 000000008eadd000 CR4: 00000000003407f0
> [  601.568035] Stack:
> [  601.599051]  ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30 ffff88003b5d5cf0
> [  601.696409]  ffffffffa055003f ffff88008eac5a40 ffff88005cb96000 ffff88008eac5b30
> [  601.793636]  ffff880145208000 ffff88008eac5ac8 ffff88003b5d5d30 ffffffffa05513fe
> [  601.890889] Call Trace:
> [  601.927323]  [<ffffffffa055003f>] i915_vma_unbind+0x28f/0x340 [i915]
> [  602.011520]  [<ffffffffa05513fe>] i915_gem_free_object+0x9e/0x340 [i915]
> [  602.100135]  [<ffffffff810b81cd>] ? trace_hardirqs_on+0xd/0x10
> [  602.178010]  [<ffffffffa04c248a>] drm_gem_object_free+0x2a/0x30 [drm]
> [  602.263249]  [<ffffffffa04c29fa>] drm_gem_object_handle_unreference_unlocked+0x11a/0x130 [drm]
> [  602.375214]  [<ffffffffa04c2ae6>] drm_gem_handle_delete+0xd6/0x1d0 [drm]
> [  602.463759]  [<ffffffffa04c3358>] drm_gem_close_ioctl+0x28/0x30 [drm]
> [  602.549031]  [<ffffffffa04c0d92>] drm_ioctl+0x502/0x640 [drm]
> [  602.625820]  [<ffffffff8115ac70>] ? might_fault+0xa0/0xb0
> [  602.698152]  [<ffffffff8115ac27>] ? might_fault+0x57/0xb0
> [  602.770831]  [<ffffffff8100f0ec>] ? __restore_xstate_sig+0x13c/0x600
> [  602.855035]  [<ffffffff811bb6c5>] do_vfs_ioctl+0x305/0x530
> [  602.928680]  [<ffffffff811c73a7>] ? fget_light+0x387/0x4f0
> [  603.001415]  [<ffffffff811bb971>] SyS_ioctl+0x81/0xa0
> [  603.069506]  [<ffffffff814dd6d6>] system_call_fastpath+0x1a/0x1f
> [  603.148332] Code: 89 e7 89 c2 41 ff d6 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 48 89 f7 53 <4d> 8b ac 24 08 01 00 00 48 8b 56 08 48 8b 9e b8 00 00 00 48 8b
> [  603.392564] RIP  [<ffffffffa0559794>] i915_ppgtt_unbind_object+0x14/0x60 [i915]
> [  603.487472]  RSP <ffff88003b5d5ca0>
> [  603.535732] CR2: 0000000000000108
> [  603.622175] ---[ end trace f23239922ecdffbd ]---
> 
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40d9dcf..2b9245d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2741,6 +2741,13 @@ int i915_vma_unbind(struct i915_vma *vma)
>  
>  	if (obj->has_global_gtt_mapping)
>  		i915_gem_gtt_unbind_object(obj);
> +
> +	if (unlikely(!dev_priv->mm.aliasing_ppgtt &&
> +		     obj->has_aliasing_ppgtt_mapping)) {
> +		DRM_DEBUG_DRIVER("Leftover PPGTT mapping after reset\n");
> +		obj->has_aliasing_ppgtt_mapping = 0;
> +	}

Nack on sprinkling band-aids all over the place when imo the real bug is
that the our init_hw functions destroys sw tracking state behind our back:

http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg27025.html

If we rip out the i915_gem_cleanup_aliasing_ppgtt we shouldn't blow up,
but can keep on going (without the gt ofc).
-Daniel

> +
>  	if (obj->has_aliasing_ppgtt_mapping) {
>  		i915_ppgtt_unbind_object(dev_priv->mm.aliasing_ppgtt, obj);
>  		obj->has_aliasing_ppgtt_mapping = 0;
> -- 
> 1.8.4.2
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch



More information about the Intel-gfx mailing list