[PATCH 1/3] drm/xe/guc/ct: Improve g2h request handling during async gt reset

Mon Oct 14 12:10:16 UTC 2024

Hi Matt,

Thanks for review comments.

On 11-10-2024 04:31, Matthew Brost wrote:
> On Wed, Oct 09, 2024 at 04:26:43PM +0530, Badal Nilawar wrote:
>> It is possible that a g2h request may be cancelled while waiting for a
>> response due to an asynchronous gt reset. This commit ensures that in
>> such cases, caller will be notified by returning -ECANCELED.
>>
>> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
>> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Matthew Auld <matthew.auld at intel.com>
>> Cc: John Harrison <John.C.Harrison at Intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc_ct.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index c7673f56d413..b93b2821e4e8 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -512,6 +512,9 @@ void xe_guc_ct_stop(struct xe_guc_ct *ct)
>>   {
>>   	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_STOPPED);
>>   	stop_g2h_handler(ct);
>> +
>> +	/* Notify callers that CT stopped and G2H requests are cancelled */
>> +	wake_up_all(&ct->g2h_fence_wq);
>>   }
>>   
>>   static bool h2g_has_room(struct xe_guc_ct *ct, u32 cmd_len)
>> @@ -1018,6 +1021,19 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>   
>>   	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> 
> Better would be abort the wait here if a GT reset is queue'd or in
> progess. We do this a lot in the xe_guc_submit.c - see any of the
> wait_event functions in that file. We likely should normalize this a bit
> with proper layering but basically the flow should be:
> 
> - Any wait_event_* are OR'd with a queued or in progess GT reset

In xe_guc_submit.c to check if reset queued/progress we check guc 
submission is stopped xe_guc_read_stopped(). Are you suggesting to use 
xe_guc_read_stopped instead of checking ct->state?

Or we should do like this?

ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done || ct->state 
== XE_GUC_CT_STATE_STOPPED, HZ);

> 
> - After wait_event_* signals check for OR condition, handle gracefully
>    via an error code kicking it to upper layers

Agree.

> 
> - All upper layers need to cope with H2G failing or use *_no_fail
>    versions the H2G functions. The *_no_fail versions are untested as I
>    coded those 2.5 years ago in Xe and don't have user of those functions

Ok.

> 
> - Queuing a GT reset wakes up all waiters

How should we do this. After queening GT reset or during GT reset CT 
communication will still be there. Especially during gt start we do 
guc_pc_start there xe_guc_send_recv is used for SLPC check.

> 
> - Upon completion of GT reset the OR condition is cleared

Ok. Condition will be cleared once CT is enabled.

Regards,
Badal

> 
> Matt
> 
>>   
>> +	/*
>> +	 * It is possible that the g2h request may be cancelled while waiting for a response due
>> +	 * to an asynchronous gt reset. In such cases, return -ECANCELED.
>> +	 */
>> +	mutex_lock(&ct->lock);
>> +	if (ct->state == XE_GUC_CT_STATE_STOPPED) {
>> +		xe_gt_dbg(gt, "H2G action %#x canceled as GT reset is in progress\n",
>> +			  action[0]);
>> +		mutex_unlock(&ct->lock);
>> +		return -ECANCELED;
>> +	}
>> +	mutex_unlock(&ct->lock);
>> +
>>   	/*
>>   	 * Ensure we serialize with completion side to prevent UAF with fence going out of scope on
>>   	 * the stack, since we have no clue if it will fire after the timeout before we can erase
>> -- 
>> 2.34.1
>>