[Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context

Thu Sep 24 13:34:14 UTC 2020

On 16/09/2020 10:42, Chris Wilson wrote:
> We have to be very careful while walking the timeline->requests list
> under the RCU guard, as the requests (and so rq->link) use
> SLAB_TYPESAFE_BY_RCU and so the requests may be reallocated within an
> rcu grace period. As the requests are reallocated, they are removed from
> one list and placed on another, and if we are iterating over that
> request at that moment, the list iteration jumps from one list to the
> next and promptly gets confused. Verify we hold the request reference
> to ensure that the request is not added to a new list behind our backs.
> 
> <4> [582.745252] general protection fault, probably for non-canonical address 0xcccccccccccccd5c: 0000 [#1] PREEMPT SMP PTI
> <4> [582.745297] CPU: 0 PID: 1475 Comm: gem_ctx_persist Not tainted 5.9.0-rc1-CI-CI_DRM_8908+ #1
> <4> [582.745304] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
> <4> [582.745317] RIP: 0010:__lock_acquire+0x2c3/0x1f40
> <4> [582.745323] Code: 00 65 8b 05 c7 8a ef 7e 85 c0 0f 85 b4 07 00 00 44 8b 9d c4 08 00 00 45 85 db 0f 84 0f 01 00 00 ba 05 00 00 00 e9 c8 06 00 00 <48> 81 3f c0 89 c7 82 b8 00 00 00 00 41 0f 45 c0 83 fe 01 41 89 c3
> <4> [582.745334] RSP: 0018:ffffc9000461bc40 EFLAGS: 00010002
> <4> [582.745340] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> <4> [582.745345] RDX: 0000000000000000 RSI: 0000000000000000 RDI: cccccccccccccd5c
> <4> [582.745350] RBP: ffff8881ec4a2880 R08: 0000000000000001 R09: 0000000000000001
> <4> [582.745356] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> <4> [582.745361] R13: 0000000000000000 R14: 0000000000000000 R15: cccccccccccccd5c
> <4> [582.745367] FS:  00007fb44da78e40(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
> <4> [582.745373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [582.745378] CR2: 00007fb44daad040 CR3: 0000000268428000 CR4: 0000000000350ef0
> <4> [582.745383] Call Trace:
> <4> [582.745390]  ? __lock_acquire+0x913/0x1f40
> <4> [582.745397]  lock_acquire+0xb5/0x3c0
> <4> [582.745526]  ? kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745533]  ? find_held_lock+0x2d/0x90
> <4> [582.745541]  _raw_spin_lock_irq+0x30/0x40
> <4> [582.745635]  ? kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745727]  kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745820]  context_close+0x195/0x410 [i915]
> <4> [582.745912]  i915_gem_context_close+0x5b/0x160 [i915]
> <4> [582.745994]  i915_driver_postclose+0x14/0x40 [i915]
> <4> [582.746003]  drm_file_free.part.13+0x240/0x290
> <4> [582.746009]  drm_release_noglobal+0x16/0x50
> <4> [582.746016]  __fput+0xa5/0x250
> <4> [582.746021]  task_work_run+0x6e/0xb0
> <4> [582.746028]  exit_to_user_mode_prepare+0x178/0x180
> <4> [582.746034]  syscall_exit_to_user_mode+0x36/0x220
> <4> [582.746040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> <4> [582.746045] RIP: 0033:0x7fb44d1dc421
> <4> [582.746050] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 ea cf 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
> <4> [582.746062] RSP: 002b:00007ffed2e83818 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> <4> [582.746069] RAX: 0000000000000000 RBX: 0000556410bfe840 RCX: 00007fb44d1dc421
> <4> [582.746075] RDX: 000000000000000a RSI: 00000000c0406469 RDI: 0000000000000008
> <4> [582.746080] RBP: 0000000000000008 R08: 00007fb44d1c51cc R09: 00007fb44d1c5240
> <4> [582.746086] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000fffffffb
> <4> [582.746091] R13: 0000000000000006 R14: 0000000000000000 R15: 000000000000000a
> <4> [582.746099] Modules linked in: vgem mei_hdcp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio btusb btrtl btbcm btintel x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul bluetooth ghash_clmulni_intel ecdh_generic ecc i915 r8169 realtek mei_me mei snd_hda_intel i2c_hid snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_geminilake pinctrl_intel prime_numbers [last unloaded: test_drm_mm]
> 
> Fixes: 09a3054f38db ("drm/i915/gem: Reduce context termination list iteration guard to RCU")
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 25 ++++++++++++++++-----
>   1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index cf5ecbde9e06..a548626fa8bc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -460,8 +460,8 @@ __active_engine(struct i915_request *rq, struct intel_engine_cs **active)
>   		spin_lock(&locked->active.lock);
>   	}
>   
> -	if (!i915_request_completed(rq)) {
> -		if (i915_request_is_active(rq) && rq->fence.error != -EIO)
> +	if (i915_request_is_active(rq)) {
> +		if (!i915_request_completed(rq))
>   			*active = locked;
>   		ret = true;
>   	}
> @@ -479,13 +479,26 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
>   	if (!ce->timeline)
>   		return NULL;
>   
> +	/*
> +	 * rq->link is only SLAB_TYPESAFE_BY_RCU, we need to hold a reference
> +	 * to the request to prevent it being transferred to a new timeline
> +	 * (and onto a new timeline->requests list).
> +	 */
>   	rcu_read_lock();
> -	list_for_each_entry_rcu(rq, &ce->timeline->requests, link) {
> -		if (i915_request_is_active(rq) && i915_request_completed(rq))
> -			continue;
> +	list_for_each_entry_reverse(rq, &ce->timeline->requests, link) {
> +		bool found;
> +
> +		/* timeline is already completed upto this point? */
> +		if (!i915_request_get_rcu(rq))
> +			break;
>   
>   		/* Check with the backend if the request is inflight */
> -		if (__active_engine(rq, &engine))
> +		found = true;
> +		if (likely(rcu_access_pointer(rq->timeline) == ce->timeline))

So the timeline check is equivalent to request_completed in a way which 
I think holds and now I see comment above documents it.

> +			found = __active_engine(rq, &engine);
> +
> +		i915_request_put(rq);
> +		if (found)
>   			break;
>   	}
>   	rcu_read_unlock();
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko