[Intel-xe] [PATCH v2] drm/xe: fix mem_access for early lrc generation

Tue Nov 28 09:27:05 UTC 2023

On Mon, Nov 27, 2023 at 09:44:59AM +0000, Matthew Auld wrote:
> We spawn some hw queues during device probe to generate the default LRC
> for every engine type, however the queue destruction step is typically
> async. Queue destruction needs to do stuff like GuC context deregister
> which requires GuC CT, which in turn requires an active mem_access ref.
> The caller during probe is meant to hold the mem_access token, however
> due to the async destruction it might have already been dropped if we
> are unlucky.
> 
> Similar to how we already handle migrate VMs for which there is no
> mem_access ref, fix this by keeping the callers token alive, releasing
> it only when destroying the queue. We can treat a NULL vm as indication
> that we need to grab our own extra ref.
> 
> Fixes the following splat sometimes during load:
> 
> [ 1682.899930] WARNING: CPU: 1 PID: 8642 at drivers/gpu/drm/xe/xe_device.c:537 xe_device_assert_mem_access+0x27/0x30 [xe]
> [ 1682.900209] CPU: 1 PID: 8642 Comm: kworker/u24:97 Tainted: G     U  W   E    N 6.6.0-rc3+ #6
> [ 1682.900214] Workqueue: submit_wq xe_sched_process_msg_work [xe]
> [ 1682.900303] RIP: 0010:xe_device_assert_mem_access+0x27/0x30 [xe]
> [ 1682.900388] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 1e 6c 03 00 48 85 c0 74 06 5b c3 cc cc cc cc 8b 83 28 23 00 00 85 c0 75 f0 <0f> 0b 5b c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [ 1682.900390] RSP: 0018:ffffc900021cfb68 EFLAGS: 00010246
> [ 1682.900394] RAX: 0000000000000000 RBX: ffff8886a96d8000 RCX: 0000000000000000
> [ 1682.900396] RDX: 0000000000000001 RSI: ffff8886a6311a00 RDI: ffff8886a96d8000
> [ 1682.900398] RBP: ffffc900021cfcc0 R08: 0000000000000001 R09: 0000000000000000
> [ 1682.900400] R10: ffffc900021cfcd0 R11: 0000000000000002 R12: 0000000000000004
> [ 1682.900402] R13: 0000000000000000 R14: ffff8886a6311990 R15: ffffc900021cfd74
> [ 1682.900405] FS:  0000000000000000(0000) GS:ffff888829880000(0000) knlGS:0000000000000000
> [ 1682.900407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1682.900409] CR2: 000055f70bad3fb0 CR3: 000000025243a004 CR4: 00000000003706e0
> [ 1682.900412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1682.900413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1682.900415] Call Trace:
> [ 1682.900418]  <TASK>
> [ 1682.900420]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
> [ 1682.900504]  ? __warn+0x85/0x170
> [ 1682.900510]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
> [ 1682.900596]  ? report_bug+0x171/0x1a0
> [ 1682.900604]  ? handle_bug+0x3c/0x80
> [ 1682.900608]  ? exc_invalid_op+0x17/0x70
> [ 1682.900612]  ? asm_exc_invalid_op+0x1a/0x20
> [ 1682.900621]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
> [ 1682.900706]  ? xe_device_assert_mem_access+0x12/0x30 [xe]
> [ 1682.900790]  guc_ct_send_locked+0xb9/0x1550 [xe]
> [ 1682.900882]  ? lock_acquire+0xca/0x2b0
> [ 1682.900885]  ? guc_ct_send+0x3c/0x1a0 [xe]
> [ 1682.900977]  ? lock_is_held_type+0x9b/0x110
> [ 1682.900984]  ? __mutex_lock+0xc0/0xb90
> [ 1682.900989]  ? __pfx___drm_printfn_info+0x10/0x10
> [ 1682.900999]  guc_ct_send+0x53/0x1a0 [xe]
> [ 1682.901090]  ? __lock_acquire+0xf22/0x21b0
> [ 1682.901097]  ? process_one_work+0x1a0/0x500
> [ 1682.901109]  xe_guc_ct_send+0x19/0x50 [xe]
> [ 1682.901202]  set_min_preemption_timeout+0x75/0xa0 [xe]
> [ 1682.901294]  disable_scheduling_deregister+0x55/0x250 [xe]
> [ 1682.901383]  ? xe_sched_process_msg_work+0x76/0xd0 [xe]
> [ 1682.901467]  ? lock_release+0xc9/0x260
> [ 1682.901474]  xe_sched_process_msg_work+0x82/0xd0 [xe]
> [ 1682.901559]  process_one_work+0x20a/0x500
> 
> v2: Add the splat
> 
> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> Cc: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>

Reviewed-by: Matthew Brost <matthew.brost at intel.com>

> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_exec_queue.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 62d0237e724e..7a9186cea5a6 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -90,12 +90,12 @@ static struct xe_exec_queue *__xe_exec_queue_create(struct xe_device *xe,
>  	/*
>  	 * Normally the user vm holds an rpm ref to keep the device
>  	 * awake, and the context holds a ref for the vm, however for
> -	 * some engines we use the kernels migrate vm underneath which
> -	 * offers no such rpm ref. Make sure we keep a ref here, so we
> -	 * can perform GuC CT actions when needed. Caller is expected to
> -	 * have already grabbed the rpm ref outside any sensitive locks.
> +	 * some engines we use the kernels migrate vm underneath which offers no
> +	 * such rpm ref, or we lack a vm. Make sure we keep a ref here, so we
> +	 * can perform GuC CT actions when needed. Caller is expected to have
> +	 * already grabbed the rpm ref outside any sensitive locks.
>  	 */
> -	if (q->flags & EXEC_QUEUE_FLAG_VM)
> +	if (q->flags & EXEC_QUEUE_FLAG_VM || !vm)
>  		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
>  
>  	return q;
> @@ -172,10 +172,10 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>  
>  	for (i = 0; i < q->width; ++i)
>  		xe_lrc_finish(q->lrc + i);
> +	if (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm)
> +		xe_device_mem_access_put(gt_to_xe(q->gt));
>  	if (q->vm)
>  		xe_vm_put(q->vm);
> -	if (q->flags & EXEC_QUEUE_FLAG_VM)
> -		xe_device_mem_access_put(gt_to_xe(q->gt));
>  
>  	kfree(q);
>  }
> -- 
> 2.43.0
>