[Intel-xe] [PATCH v2 3/3] drm/xe: Avoid any races around ccs_mode update

Wed Sep 4 10:28:47 UTC 2024

On Tue, Nov 28, 2023 at 06:57:16PM -0800, Niranjana Vishwanathapura wrote:
> Ensure that there are no drm clients when changing CCS mode.
> Allow exec_queue creation only with enabled CCS engines.
> 
> v2: Rebase
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       |  9 +++++++++
>  drivers/gpu/drm/xe/xe_device_types.h |  9 +++++++++
>  drivers/gpu/drm/xe/xe_gt_ccs_mode.c  | 10 ++++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 54202623e255..c7134d377b4c 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -72,6 +72,10 @@ static int xe_file_open(struct drm_device *dev, struct drm_file *file)
>  	mutex_init(&xef->exec_queue.lock);
>  	xa_init_flags(&xef->exec_queue.xa, XA_FLAGS_ALLOC1);
>  
> +	spin_lock(&xe->clients.lock);
> +	xe->clients.count++;
> +	spin_unlock(&xe->clients.lock);
> +
>  	file->driver_priv = xef;
>  	return 0;
>  }
> @@ -104,6 +108,10 @@ static void xe_file_close(struct drm_device *dev, struct drm_file *file)
>  	xa_destroy(&xef->vm.xa);
>  	mutex_destroy(&xef->vm.lock);
>  
> +	spin_lock(&xe->clients.lock);
> +	xe->clients.count--;
> +	spin_unlock(&xe->clients.lock);
> +
>  	xe_drm_client_put(xef->client);
>  	kfree(xef);
>  }
> @@ -226,6 +234,7 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
>  	xe->info.force_execlist = xe_modparam.force_execlist;
>  
>  	spin_lock_init(&xe->irq.lock);
> +	spin_lock_init(&xe->clients.lock);
>  
>  	init_waitqueue_head(&xe->ufence_wq);
>  
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 2712905c7a91..12cb1caacda8 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -306,6 +306,15 @@ struct xe_device {
>  		enum xe_sriov_mode __mode;
>  	} sriov;
>  
> +	/** @clients: drm clients info */
> +	struct {
> +		/** @lock: Protects drm clients info */
> +		spinlock_t lock;
> +
> +		/** @count: number of drm clients */
> +		u64 count;
> +	} clients;
> +
>  	/** @usm: unified memory state */
>  	struct {
>  		/** @asid: convert a ASID to VM */
> diff --git a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
> index 0ad97383aedd..303ffb0761d8 100644
> --- a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
> +++ b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
> @@ -90,6 +90,7 @@ ccs_mode_store(struct device *kdev, struct device_attribute *attr,
>  	       const char *buff, size_t count)
>  {
>  	struct xe_gt *gt = kobj_to_gt(&kdev->kobj);
> +	struct xe_device *xe = gt_to_xe(gt);
>  	u32 num_engines, num_slices;
>  	int ret;
>  
> @@ -108,12 +109,21 @@ ccs_mode_store(struct device *kdev, struct device_attribute *attr,
>  		return -EINVAL;
>  	}
>  
> +	/* CCS mode can only be updated when there are no drm clients */
> +	spin_lock(&xe->clients.lock);
> +	if (xe->clients.count) {
> +		spin_unlock(&xe->clients.lock);
> +		return -EBUSY;
> +	}
> +
>  	if (gt->ccs_mode.num_engines != num_engines) {
>  		xe_gt_info(gt, "Setting compute mode to %d\n", num_engines);
>  		gt->ccs_mode.num_engines = num_engines;
>  		xe_gt_reset_async(gt);
>  	}
>  
> +	spin_unlock(&xe->clients.lock);

I pinged Rodrigo on #xe irc, but I think better I also drop this in a mail
reply. I think the patch doesn't yet reach it's goal of closing all races
here:

- xe_gt_reset_async() doesn't schedule a reset if there's one pending, so
  I htink you have a race here where you want a function that waits for
  any pending reset to finish, and _then_ issues a new one.

- The reset code does not have any protection around reading ->ccs_mode,
  so might get different values at different times. That's no good. You
  need to make sure you're only reading this once, and then using it
  everywhere. Simplest fix would be a gt->ccs_mode.num_engines_stage,
  which you WRITE_ONCE here, and then at the top of the reset code you do

	gt->ccs_mode.num_engines = READ_ONCE(gt->ccs_mode.num_engines_stage);

  But if you go with this you need to wait for the reset to finish while
  holding the lock or there's new races.

- Because you use the ordered wq there's no need to flush anything that's
  already queued. I think it'd still be good to make sure that drm/sched
  absolutely does not reinsert anything, or some other assert to make sure
  it's all clean would imo be adequate paranoia levels.

- Also for paranoia I'd make this a mutex and just block for the reset to
  finish. It shouldn't cost and it will stop any potential issues if the
  reset is still ongoing. Shouldn't due to the ordered wq, but this is so
  race and fallout hard to catch I think it's better to be absolutely
  safe.

Cheers, Sima

> +
>  	return count;
>  }
>  
> -- 
> 2.21.0.rc0.32.g243a4c7e27
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch