[Intel-xe] [PATCH] drm/xe/guc_submit: prevent repeated unregister

Matthew Brost matthew.brost at intel.com
Wed Aug 2 15:56:33 UTC 2023


On Wed, Aug 02, 2023 at 03:03:16PM +0100, Matthew Auld wrote:
> It seems that various things can trigger the lr cleanup worker,
> including CAT error, engine reset and destroying the actual engine, so
> seems plausible to end up triggering the worker more than once in some
> cases. If that does happen we can race with an ongoing engine deregister
> before it has completed, thus triggering it again and also changing the
> state back into pending_disable. Checking if the engine has been marked
> as destroyed looks like it should prevent this.
> 
> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 8d2b50e03ed2..2459e9dd95e8 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -802,8 +802,18 @@ static void xe_guc_engine_lr_cleanup(struct work_struct *w)
>  	/* Kill the run_job / process_msg entry points */
>  	drm_sched_run_wq_stop(sched);
>  
> -	/* Engine state now stable, disable scheduling / deregister if needed */
> -	if (engine_registered(e)) {
> +	/*
> +	 * Engine state now mostly stable, disable scheduling / deregister if
> +	 * needed. This cleanup routine might be called multiple times, where
> +	 * the actual async engine deregister drops the final engine ref.
> +	 * Calling disable_scheduling_deregister will mark the engine as
> +	 * destroyed and fire off the CT requests to disable scheduling /
> +	 * deregister, which we only want to do once. We also don't want to mark
> +	 * the engine as pending_disable again as this may race with the
> +	 * xe_guc_deregister_done_handler() which treats it as an unexpected
> +	 * state.
> +	 */
> +	if (engine_registered(e) && !engine_destroyed(e)) {

This looks correct but a few things.

This will need to be rebased after the s/engine/exec_queue patch that has landed.

guc_exec_queue_timedout_job I think has the same issue too. Can you post a patch for that too.

Anyways this one LGTM:

Reviewed-by: Matthew Brost <matthew.brost at intel.com>

>  		struct xe_guc *guc = engine_to_guc(e);
>  		int ret;
>  
> -- 
> 2.41.0
> 


More information about the Intel-xe mailing list