[Intel-xe] [PATCH] drm/xe/guc_submit: prevent repeated unregister
Matthew Brost
matthew.brost at intel.com
Wed Aug 2 15:56:33 UTC 2023
On Wed, Aug 02, 2023 at 03:03:16PM +0100, Matthew Auld wrote:
> It seems that various things can trigger the lr cleanup worker,
> including CAT error, engine reset and destroying the actual engine, so
> seems plausible to end up triggering the worker more than once in some
> cases. If that does happen we can race with an ongoing engine deregister
> before it has completed, thus triggering it again and also changing the
> state back into pending_disable. Checking if the engine has been marked
> as destroyed looks like it should prevent this.
>
> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> ---
> drivers/gpu/drm/xe/xe_guc_submit.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 8d2b50e03ed2..2459e9dd95e8 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -802,8 +802,18 @@ static void xe_guc_engine_lr_cleanup(struct work_struct *w)
> /* Kill the run_job / process_msg entry points */
> drm_sched_run_wq_stop(sched);
>
> - /* Engine state now stable, disable scheduling / deregister if needed */
> - if (engine_registered(e)) {
> + /*
> + * Engine state now mostly stable, disable scheduling / deregister if
> + * needed. This cleanup routine might be called multiple times, where
> + * the actual async engine deregister drops the final engine ref.
> + * Calling disable_scheduling_deregister will mark the engine as
> + * destroyed and fire off the CT requests to disable scheduling /
> + * deregister, which we only want to do once. We also don't want to mark
> + * the engine as pending_disable again as this may race with the
> + * xe_guc_deregister_done_handler() which treats it as an unexpected
> + * state.
> + */
> + if (engine_registered(e) && !engine_destroyed(e)) {
This looks correct but a few things.
This will need to be rebased after the s/engine/exec_queue patch that has landed.
guc_exec_queue_timedout_job I think has the same issue too. Can you post a patch for that too.
Anyways this one LGTM:
Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> struct xe_guc *guc = engine_to_guc(e);
> int ret;
>
> --
> 2.41.0
>
More information about the Intel-xe
mailing list