[PATCH 2/2] drm/xe: Add lock around devcoredump capture status
John Harrison
john.c.harrison at intel.com
Thu Nov 21 16:56:19 UTC 2024
On 11/20/2024 19:30, Matthew Brost wrote:
> On Wed, Nov 20, 2024 at 03:22:57PM -0800, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> There are now multiple places that can trigger a coredump. Some of
>> which can happen in parallel. There is already a check against
>> capturing multiple dumps sequentially, but without locking it doesn't
>> guarantee to work against concurrent dumps. And if two dumps do happen
>> in parallel, they can end up doing Bad Things such as one call stack
>> freeing the data the other call stack is still processing. Which leads
>> to a crashed kernel. So add the lock.
>>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_devcoredump.c | 18 ++++++++++++++----
>> drivers/gpu/drm/xe/xe_devcoredump_types.h | 2 ++
>> 2 files changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>> index b5aebe8fb06d..05f3c9185ef3 100644
>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>> @@ -235,9 +235,13 @@ static void xe_devcoredump_free(void *data)
>>
>> /* To prevent stale data on next snapshot, clear everything */
>> memset(&coredump->snapshot, 0, sizeof(coredump->snapshot));
>> - coredump->captured = false;
>> +
>> drm_info(&coredump_to_xe(coredump)->drm,
>> "Xe device coredump has been deleted.\n");
>> +
>> + spin_lock_irq(&coredump->lock);
>> + coredump->captured = false;
>> + spin_unlock_irq(&coredump->lock);
> I don't think you need the IRQ version here, or in the other usage. I
> don't believe in either case here we are in IRQ contexts, or if we were
> one of the cases would use spin_lock (already in an IRQ context) and
> the other would use spin_lock_irq (not in an IRQ context but another
> case can be). Certainly taking spin_lock_irq in all use cases of a lock
> seems unnecessary.
Yeah, that was a copy/paste that I was meaning to go back and update as
necessary but evidently forgot to.
>
> Is there some follow on patches where this changes?
There will be. Still more re-work to be done for the GT targetted
capture support.
>
> Adding the lock seems right though and should enable coredumps on
> multi-GT or from which are not the GT ordered WQ.
Yeah, with the capture on GT reset I was hitting the race and got a very
dead kernel a couple of times (because there were parallel captures for
the same failure - one via GT reset and one via DRM timeout). Not sure
if it is possible to hit with the current job specific captures but I
don't see why not. If you had parallel independent hangs on both GTs of
a multi-GT system then it seems like it would still go bang. Just much
harder to hit.
>
> Maybe also add helpers like...
>
> void xe_devcoredump_release()
> {
> spin_lock(&coredump->lock);
> coredump->captured = false;
> spin_unlock(&coredump->lock);
> }
>
> bool xe_devcoredump_acquire();
> {
> bool acquired;
>
> spin_lock(&coredump->lock);
> acquired = !coredump->captured;
> coredump->captured = true;
> spin_unlock(&coredump->lock);
>
> return acquired;
> }
>
> IMO - this makes this locking semantic a bit more clear.
Not convinced. There are exactly one places where each of these
functions would be used. And acquire/release doesn't seem like the right
terminology to me. Especially when the release is from a totally
different call path. The 'acquire' is an internally triggered thing by
the driver but the 'release' is in response to a sysfs write from the
user. The intent is really just alloc and free but for a singleton
entity so there can only ever be one instance at a time. However, the
above code is not the actual alloc/free so calling it that would be wrong.
The implementation is more about back-off-when-busy than regular mutex
locking. It could potentially be done with a kref thing and
inc-if-not-zero functionality, but that also seemed more confusion than
just having a spinlock around a simple bool.
John.
>
>> }
>>
>> static void devcoredump_snapshot(struct xe_devcoredump *coredump,
>> @@ -310,14 +314,18 @@ void xe_devcoredump(struct xe_exec_queue *q, struct xe_sched_job *job, const cha
>> struct xe_device *xe = gt_to_xe(q->gt);
>> struct xe_devcoredump *coredump = &xe->devcoredump;
>> va_list varg;
>> + bool captured;
>>
>> - if (coredump->captured) {
>> + spin_lock_irq(&coredump->lock);
>> + captured = coredump->captured;
>> + coredump->captured = true;
>> + spin_unlock_irq(&coredump->lock);
>> +
>> + if (captured) {
> Then here...
>
> if (!xe_devcoredump_acquire())
> msg...
> return;
>
>> drm_dbg(&xe->drm, "Multiple hangs are occurring, but only the first snapshot was taken\n");
>> return;
>> }
>>
>> - coredump->captured = true;
>> -
>> va_start(varg, fmt);
>> coredump->snapshot.reason = kvasprintf(GFP_KERNEL, fmt, varg);
>> va_end(varg);
>> @@ -342,6 +350,8 @@ static void xe_driver_devcoredump_fini(void *arg)
>>
>> int xe_devcoredump_init(struct xe_device *xe)
>> {
>> + spin_lock_init(&xe->devcoredump.lock);
>> +
>> return devm_add_action_or_reset(xe->drm.dev, xe_driver_devcoredump_fini, &xe->drm);
>> }
>>
>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>> index e6234e887102..6b776abd8bf4 100644
>> --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
>> +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>> @@ -80,6 +80,8 @@ struct xe_devcoredump_snapshot {
>> * for reading the information.
>> */
>> struct xe_devcoredump {
>> + /** @lock: protects access to 'captured' to prevent concurrent captures */
>> + spinlock_t lock;
>> /** @captured: The snapshot of the first hang has already been taken. */
> Protected by @lock.
>
> Matt
>
>> bool captured;
>> /** @snapshot: Snapshot is captured at time of the first crash */
>> --
>> 2.47.0
>>
More information about the Intel-xe
mailing list