[Intel-xe] [PATCH v2] drm/xe: fix mem_access for early lrc generation

Wed Nov 29 10:39:29 UTC 2023

On 28/11/2023 09:27, Matthew Brost wrote:
> On Mon, Nov 27, 2023 at 09:44:59AM +0000, Matthew Auld wrote:
>> We spawn some hw queues during device probe to generate the default LRC
>> for every engine type, however the queue destruction step is typically
>> async. Queue destruction needs to do stuff like GuC context deregister
>> which requires GuC CT, which in turn requires an active mem_access ref.
>> The caller during probe is meant to hold the mem_access token, however
>> due to the async destruction it might have already been dropped if we
>> are unlucky.
>>
>> Similar to how we already handle migrate VMs for which there is no
>> mem_access ref, fix this by keeping the callers token alive, releasing
>> it only when destroying the queue. We can treat a NULL vm as indication
>> that we need to grab our own extra ref.
>>
>> Fixes the following splat sometimes during load:
>>
>> [ 1682.899930] WARNING: CPU: 1 PID: 8642 at drivers/gpu/drm/xe/xe_device.c:537 xe_device_assert_mem_access+0x27/0x30 [xe]
>> [ 1682.900209] CPU: 1 PID: 8642 Comm: kworker/u24:97 Tainted: G     U  W   E    N 6.6.0-rc3+ #6
>> [ 1682.900214] Workqueue: submit_wq xe_sched_process_msg_work [xe]
>> [ 1682.900303] RIP: 0010:xe_device_assert_mem_access+0x27/0x30 [xe]
>> [ 1682.900388] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 1e 6c 03 00 48 85 c0 74 06 5b c3 cc cc cc cc 8b 83 28 23 00 00 85 c0 75 f0 <0f> 0b 5b c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90
>> [ 1682.900390] RSP: 0018:ffffc900021cfb68 EFLAGS: 00010246
>> [ 1682.900394] RAX: 0000000000000000 RBX: ffff8886a96d8000 RCX: 0000000000000000
>> [ 1682.900396] RDX: 0000000000000001 RSI: ffff8886a6311a00 RDI: ffff8886a96d8000
>> [ 1682.900398] RBP: ffffc900021cfcc0 R08: 0000000000000001 R09: 0000000000000000
>> [ 1682.900400] R10: ffffc900021cfcd0 R11: 0000000000000002 R12: 0000000000000004
>> [ 1682.900402] R13: 0000000000000000 R14: ffff8886a6311990 R15: ffffc900021cfd74
>> [ 1682.900405] FS:  0000000000000000(0000) GS:ffff888829880000(0000) knlGS:0000000000000000
>> [ 1682.900407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1682.900409] CR2: 000055f70bad3fb0 CR3: 000000025243a004 CR4: 00000000003706e0
>> [ 1682.900412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1682.900413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1682.900415] Call Trace:
>> [ 1682.900418]  <TASK>
>> [ 1682.900420]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
>> [ 1682.900504]  ? __warn+0x85/0x170
>> [ 1682.900510]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
>> [ 1682.900596]  ? report_bug+0x171/0x1a0
>> [ 1682.900604]  ? handle_bug+0x3c/0x80
>> [ 1682.900608]  ? exc_invalid_op+0x17/0x70
>> [ 1682.900612]  ? asm_exc_invalid_op+0x1a/0x20
>> [ 1682.900621]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
>> [ 1682.900706]  ? xe_device_assert_mem_access+0x12/0x30 [xe]
>> [ 1682.900790]  guc_ct_send_locked+0xb9/0x1550 [xe]
>> [ 1682.900882]  ? lock_acquire+0xca/0x2b0
>> [ 1682.900885]  ? guc_ct_send+0x3c/0x1a0 [xe]
>> [ 1682.900977]  ? lock_is_held_type+0x9b/0x110
>> [ 1682.900984]  ? __mutex_lock+0xc0/0xb90
>> [ 1682.900989]  ? __pfx___drm_printfn_info+0x10/0x10
>> [ 1682.900999]  guc_ct_send+0x53/0x1a0 [xe]
>> [ 1682.901090]  ? __lock_acquire+0xf22/0x21b0
>> [ 1682.901097]  ? process_one_work+0x1a0/0x500
>> [ 1682.901109]  xe_guc_ct_send+0x19/0x50 [xe]
>> [ 1682.901202]  set_min_preemption_timeout+0x75/0xa0 [xe]
>> [ 1682.901294]  disable_scheduling_deregister+0x55/0x250 [xe]
>> [ 1682.901383]  ? xe_sched_process_msg_work+0x76/0xd0 [xe]
>> [ 1682.901467]  ? lock_release+0xc9/0x260
>> [ 1682.901474]  xe_sched_process_msg_work+0x82/0xd0 [xe]
>> [ 1682.901559]  process_one_work+0x20a/0x500
>>
>> v2: Add the splat
>>
>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>> Cc: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
> 
> Reviewed-by: Matthew Brost <matthew.brost at intel.com>

Thanks.

> 
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_exec_queue.c | 14 +++++++-------
>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> index 62d0237e724e..7a9186cea5a6 100644
>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> @@ -90,12 +90,12 @@ static struct xe_exec_queue *__xe_exec_queue_create(struct xe_device *xe,
>>   	/*
>>   	 * Normally the user vm holds an rpm ref to keep the device
>>   	 * awake, and the context holds a ref for the vm, however for
>> -	 * some engines we use the kernels migrate vm underneath which
>> -	 * offers no such rpm ref. Make sure we keep a ref here, so we
>> -	 * can perform GuC CT actions when needed. Caller is expected to
>> -	 * have already grabbed the rpm ref outside any sensitive locks.
>> +	 * some engines we use the kernels migrate vm underneath which offers no
>> +	 * such rpm ref, or we lack a vm. Make sure we keep a ref here, so we
>> +	 * can perform GuC CT actions when needed. Caller is expected to have
>> +	 * already grabbed the rpm ref outside any sensitive locks.
>>   	 */
>> -	if (q->flags & EXEC_QUEUE_FLAG_VM)
>> +	if (q->flags & EXEC_QUEUE_FLAG_VM || !vm)
>>   		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
>>   
>>   	return q;
>> @@ -172,10 +172,10 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>>   
>>   	for (i = 0; i < q->width; ++i)
>>   		xe_lrc_finish(q->lrc + i);
>> +	if (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm)
>> +		xe_device_mem_access_put(gt_to_xe(q->gt));
>>   	if (q->vm)
>>   		xe_vm_put(q->vm);
>> -	if (q->flags & EXEC_QUEUE_FLAG_VM)
>> -		xe_device_mem_access_put(gt_to_xe(q->gt));
>>   
>>   	kfree(q);
>>   }
>> -- 
>> 2.43.0
>>