[Intel-gfx] [PATCH] drm/i915: Verify the engine after acquiring the active.lock

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Tue Sep 17 14:59:25 UTC 2019


On 16/09/2019 12:38, Chris Wilson wrote:
> When using virtual engines, the rq->engine is not stable until we hold
> the engine->active.lock (as the virtual engine may be exchanged with the
> sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
> we may retire a request concurrently with resubmitting it to HW, we need
> to be extra careful to verify we are holding the correct lock for the
> request's active list. This is similar to the issue we saw with
> rescheduling the virtual requests, see sched_lock_engine().
> 
> Or else:
> 
> <4> [876.736126] list_add corruption. prev->next should be next (ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610).
> <4> [876.736136] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70
> <4> [876.736137] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
> <4> [876.736154] CPU: 2 PID: 21 Comm: ksoftirqd/2 Tainted: G     U            5.3.0-CI-CI_DRM_6898+ #1
> <4> [876.736156] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019
> <4> [876.736157] RIP: 0010:__list_add_valid+0x4d/0x70
> <4> [876.736159] Code: c3 48 89 d1 48 c7 c7 20 33 0e 82 48 89 c2 e8 4a 4a bc ff 0f 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 70 33 0e 82 e8 33 4a bc ff <0f> 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 c0 33 0e 82 e8
> <4> [876.736160] RSP: 0018:ffffc9000018bd30 EFLAGS: 00010082
> <4> [876.736162] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000104
> <4> [876.736163] RDX: 0000000080000104 RSI: 0000000000000000 RDI: 00000000ffffffff
> <4> [876.736164] RBP: ffffc9000018bd68 R08: 0000000000000000 R09: 0000000000000001
> <4> [876.736165] R10: 00000000aed95de3 R11: 000000007fe927eb R12: ffff888361ffca10
> <4> [876.736166] R13: ffff888361ffa610 R14: ffff888361ffc880 R15: ffff8883f931a1f8
> <4> [876.736168] FS:  0000000000000000(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000
> <4> [876.736169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [876.736170] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0
> <4> [876.736171] PKRU: 55555554
> <4> [876.736172] Call Trace:
> <4> [876.736226]  __i915_request_submit+0x152/0x370 [i915]
> <4> [876.736263]  __execlists_submission_tasklet+0x6da/0x1f50 [i915]
> <4> [876.736293]  ? execlists_submission_tasklet+0x29/0x50 [i915]
> <4> [876.736321]  execlists_submission_tasklet+0x34/0x50 [i915]
> <4> [876.736325]  tasklet_action_common.isra.5+0x47/0xb0
> <4> [876.736328]  __do_softirq+0xd8/0x4ae
> <4> [876.736332]  ? smpboot_thread_fn+0x23/0x280
> <4> [876.736334]  ? smpboot_thread_fn+0x6b/0x280
> <4> [876.736336]  run_ksoftirqd+0x2b/0x50
> <4> [876.736338]  smpboot_thread_fn+0x1d3/0x280
> <4> [876.736341]  ? sort_range+0x20/0x20
> <4> [876.736343]  kthread+0x119/0x130
> <4> [876.736345]  ? kthread_park+0xa0/0xa0
> <4> [876.736347]  ret_from_fork+0x24/0x50
> <4> [876.736353] irq event stamp: 2290145
> <4> [876.736356] hardirqs last  enabled at (2290144): [<ffffffff8123cde8>] __slab_free+0x3e8/0x500
> <4> [876.736358] hardirqs last disabled at (2290145): [<ffffffff819cfb4d>] _raw_spin_lock_irqsave+0xd/0x50
> <4> [876.736360] softirqs last  enabled at (2290114): [<ffffffff81c0033e>] __do_softirq+0x33e/0x4ae
> <4> [876.736361] softirqs last disabled at (2290119): [<ffffffff810b815b>] run_ksoftirqd+0x2b/0x50
> <4> [876.736363] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70
> <4> [876.736364] ---[ end trace 3e58d6c7356c65bf ]---
> <4> [876.736406] ------------[ cut here ]------------
> <4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, but was ffff88840ac2c730
> <4> [876.736421] WARNING: CPU: 2 PID: 5490 at lib/list_debug.c:53 __list_del_entry_valid+0x79/0x90
> <4> [876.736422] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
> <4> [876.736433] CPU: 2 PID: 5490 Comm: i915_selftest Tainted: G     U  W         5.3.0-CI-CI_DRM_6898+ #1
> <4> [876.736435] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019
> <4> [876.736436] RIP: 0010:__list_del_entry_valid+0x79/0x90
> <4> [876.736438] Code: 0b 31 c0 c3 48 89 fe 48 c7 c7 30 34 0e 82 e8 ae 49 bc ff 0f 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 68 34 0e 82 e8 97 49 bc ff <0f> 0b 31 c0 c3 48 c7 c7 a8 34 0e 82 e8 86 49 bc ff 0f 0b 31 c0 c3
> <4> [876.736439] RSP: 0018:ffffc900003ef758 EFLAGS: 00010086
> <4> [876.736440] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000002
> <4> [876.736442] RDX: 0000000080000002 RSI: 0000000000000000 RDI: 00000000ffffffff
> <4> [876.736443] RBP: ffffc900003ef780 R08: 0000000000000000 R09: 0000000000000001
> <4> [876.736444] R10: 000000001418e4b7 R11: 000000007f0ea93b R12: ffff888361ffcab8
> <4> [876.736445] R13: ffff88843b6d0000 R14: 000000000000217c R15: 0000000000000001
> <4> [876.736447] FS:  00007f4e6f255240(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000
> <4> [876.736448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [876.736449] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0
> <4> [876.736450] PKRU: 55555554
> <4> [876.736451] Call Trace:
> <4> [876.736488]  i915_request_retire+0x224/0x8e0 [i915]
> <4> [876.736521]  i915_request_create+0x4b/0x1b0 [i915]
> <4> [876.736550]  nop_virtual_engine+0x230/0x4d0 [i915]
> 
> Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111695
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 25 ++++++++++++++++++++++---
>   1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 754a78364a63..f12358150097 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -195,6 +195,27 @@ static void free_capture_list(struct i915_request *request)
>   	}
>   }
>   
> +static void remove_from_engine(struct i915_request *rq)
> +{
> +	struct intel_engine_cs *engine, *locked;
> +
> +	/*
> +	 * Virtual engines complicate acquiring the engine timeline lock,
> +	 * as their rq->engine pointer is not stable until under that
> +	 * engine lock. The simple ploy we use is to take the lock then
> +	 * check that the rq still belongs to the newly locked engine.
> +	 */
> +	locked = READ_ONCE(rq->engine);

I thought we had rq->owner and so could bail early if that one was 
non-virtual, but that field seems like a distant memory. Still, could be 
a thought to store the original/fixed engine in there somewhere. Maybe 
it would enable some interesting asserts. Or maybe not..

> +	spin_lock(&locked->active.lock);
> +	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
> +		spin_unlock(&locked->active.lock);
> +		spin_lock(&engine->active.lock);
> +		locked = engine;
> +	}
> +	list_del(&rq->sched.link);
> +	spin_unlock(&locked->active.lock);
> +}
> +
>   static bool i915_request_retire(struct i915_request *rq)
>   {
>   	struct i915_active_request *active, *next;
> @@ -260,9 +281,7 @@ static bool i915_request_retire(struct i915_request *rq)
>   	 * request that we have removed from the HW and put back on a run
>   	 * queue.
>   	 */
> -	spin_lock(&rq->engine->active.lock);
> -	list_del(&rq->sched.link);
> -	spin_unlock(&rq->engine->active.lock);
> +	remove_from_engine(rq);
>   
>   	spin_lock(&rq->lock);
>   	i915_request_mark_complete(rq);
> 

How exactly we can retire and resubmit at the same time? Isn't 
retirement an one off event when seqno is complete, rq signaled?

Regards,

Tvrtko


More information about the Intel-gfx mailing list