[PATCH 2/2] drm/xe: Add lock around devcoredump capture status

Thu Nov 21 03:30:04 UTC 2024

On Wed, Nov 20, 2024 at 03:22:57PM -0800, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
> 
> There are now multiple places that can trigger a coredump. Some of
> which can happen in parallel. There is already a check against
> capturing multiple dumps sequentially, but without locking it doesn't
> guarantee to work against concurrent dumps. And if two dumps do happen
> in parallel, they can end up doing Bad Things such as one call stack
> freeing the data the other call stack is still processing. Which leads
> to a crashed kernel. So add the lock.
> 
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> ---
>  drivers/gpu/drm/xe/xe_devcoredump.c       | 18 ++++++++++++++----
>  drivers/gpu/drm/xe/xe_devcoredump_types.h |  2 ++
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> index b5aebe8fb06d..05f3c9185ef3 100644
> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> @@ -235,9 +235,13 @@ static void xe_devcoredump_free(void *data)
>  
>  	/* To prevent stale data on next snapshot, clear everything */
>  	memset(&coredump->snapshot, 0, sizeof(coredump->snapshot));
> -	coredump->captured = false;
> +
>  	drm_info(&coredump_to_xe(coredump)->drm,
>  		 "Xe device coredump has been deleted.\n");
> +
> +	spin_lock_irq(&coredump->lock);
> +	coredump->captured = false;
> +	spin_unlock_irq(&coredump->lock);

I don't think you need the IRQ version here, or in the other usage. I
don't believe in either case here we are in IRQ contexts, or if we were
one of the cases would use spin_lock (already in an IRQ context) and
the other would use spin_lock_irq (not in an IRQ context but another
case can be). Certainly taking spin_lock_irq in all use cases of a lock
seems unnecessary.

Is there some follow on patches where this changes?

Adding the lock seems right though and should enable coredumps on
multi-GT or from which are not the GT ordered WQ.

Maybe also add helpers like...

void xe_devcoredump_release()
{
	spin_lock(&coredump->lock);
	coredump->captured = false;
	spin_unlock(&coredump->lock);
}

bool xe_devcoredump_acquire();
{
	bool acquired;

	spin_lock(&coredump->lock);
	acquired = !coredump->captured;
	coredump->captured = true;
	spin_unlock(&coredump->lock);

	return acquired;
}

IMO - this makes this locking semantic a bit more clear.

>  }
>  
>  static void devcoredump_snapshot(struct xe_devcoredump *coredump,
> @@ -310,14 +314,18 @@ void xe_devcoredump(struct xe_exec_queue *q, struct xe_sched_job *job, const cha
>  	struct xe_device *xe = gt_to_xe(q->gt);
>  	struct xe_devcoredump *coredump = &xe->devcoredump;
>  	va_list varg;
> +	bool captured;
>  
> -	if (coredump->captured) {
> +	spin_lock_irq(&coredump->lock);
> +	captured = coredump->captured;
> +	coredump->captured = true;
> +	spin_unlock_irq(&coredump->lock);
> +
> +	if (captured) {

Then here...

if (!xe_devcoredump_acquire())
	msg...
	return;

>  		drm_dbg(&xe->drm, "Multiple hangs are occurring, but only the first snapshot was taken\n");
>  		return;
>  	}
>  
> -	coredump->captured = true;
> -
>  	va_start(varg, fmt);
>  	coredump->snapshot.reason = kvasprintf(GFP_KERNEL, fmt, varg);
>  	va_end(varg);
> @@ -342,6 +350,8 @@ static void xe_driver_devcoredump_fini(void *arg)
>  
>  int xe_devcoredump_init(struct xe_device *xe)
>  {
> +	spin_lock_init(&xe->devcoredump.lock);
> +
>  	return devm_add_action_or_reset(xe->drm.dev, xe_driver_devcoredump_fini, &xe->drm);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> index e6234e887102..6b776abd8bf4 100644
> --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
> +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> @@ -80,6 +80,8 @@ struct xe_devcoredump_snapshot {
>   * for reading the information.
>   */
>  struct xe_devcoredump {
> +	/** @lock: protects access to 'captured' to prevent concurrent captures */
> +	spinlock_t lock;
>  	/** @captured: The snapshot of the first hang has already been taken. */

Protected by @lock.

Matt

>  	bool captured;
>  	/** @snapshot: Snapshot is captured at time of the first crash */
> -- 
> 2.47.0
>