[PATCH v2 2/2] drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout

Wed Oct 16 18:51:58 UTC 2024

On 10/16/2024 04:52, Badal Nilawar wrote:
> In case if g2h worker doesn't get opportunity to within specified
> timeout delay then flush the g2h worker explicitly.
>
> v2:
>    - Describe change in comment and add TODO (Matt B/John H)
>    - Add xe_gt_warn on fence done after G2H flush (John H)
>
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902
> Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: John Harrison <John.C.Harrison at Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_ct.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 3096baa4c9f4..c4e06d6722f0 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -1028,6 +1028,21 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>   
>   	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ * 3);
>   
> +	/*
> +	 * Explicitly flush g2h_worker if it hasn’t had the chance to run after being queued due
> +	 * to delays in workqueue scheduling.
> +	 *
> +	 * TODO: Drop this change once workqueue scheduling delay issue root caused and fixed.
> +	 */
> +	if (!ret) {
> +		flush_work(&ct->g2h_worker);
> +		if (g2h_fence.done) {
> +			xe_gt_warn(gt, "G2H fence %u, action %04x, done %s after G2H flush\n",
> +				   g2h_fence.seqno, action[0], str_yes_no(g2h_fence.done));
> +			ret = 1;
> +		}
> +	}
Why bump the timeout and then do the flush? If the only issue is 
believed to be the delayed worker thread then waiting longer before 
doing the flush seems counter productive. You are just increasing the 
time taken for no benefit. Flushing the worker thread should be all that 
is required. If anything, we should be breaking the timeout up into 
smaller chunks with a flush in each so that the completion happens 
sooner not later.

Also, there was a big discussion about resets on the previous revision 
of the patch set. What happened with that? I'm not seeing anything about 
connecting with the reset paths here?

John.

> +
>   	/*
>   	 * Ensure we serialize with completion side to prevent UAF with fence going out of scope on
>   	 * the stack, since we have no clue if it will fire after the timeout before we can erase