[PATCH 2/2] drm/xe: Add lock around devcoredump capture status

Thu Nov 21 19:18:22 UTC 2024

On 11/21/2024 09:24, Matthew Brost wrote:
> On Thu, Nov 21, 2024 at 08:56:19AM -0800, John Harrison wrote:
>> On 11/20/2024 19:30, Matthew Brost wrote:
>>> On Wed, Nov 20, 2024 at 03:22:57PM -0800, John.C.Harrison at Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>
>>>> There are now multiple places that can trigger a coredump. Some of
>>>> which can happen in parallel. There is already a check against
>>>> capturing multiple dumps sequentially, but without locking it doesn't
>>>> guarantee to work against concurrent dumps. And if two dumps do happen
>>>> in parallel, they can end up doing Bad Things such as one call stack
>>>> freeing the data the other call stack is still processing. Which leads
>>>> to a crashed kernel. So add the lock.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_devcoredump.c       | 18 ++++++++++++++----
>>>>    drivers/gpu/drm/xe/xe_devcoredump_types.h |  2 ++
>>>>    2 files changed, 16 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> index b5aebe8fb06d..05f3c9185ef3 100644
>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>> @@ -235,9 +235,13 @@ static void xe_devcoredump_free(void *data)
>>>>    	/* To prevent stale data on next snapshot, clear everything */
>>>>    	memset(&coredump->snapshot, 0, sizeof(coredump->snapshot));
>>>> -	coredump->captured = false;
>>>> +
>>>>    	drm_info(&coredump_to_xe(coredump)->drm,
>>>>    		 "Xe device coredump has been deleted.\n");
>>>> +
>>>> +	spin_lock_irq(&coredump->lock);
>>>> +	coredump->captured = false;
>>>> +	spin_unlock_irq(&coredump->lock);
>>> I don't think you need the IRQ version here, or in the other usage. I
>>> don't believe in either case here we are in IRQ contexts, or if we were
>>> one of the cases would use spin_lock (already in an IRQ context) and
>>> the other would use spin_lock_irq (not in an IRQ context but another
>>> case can be). Certainly taking spin_lock_irq in all use cases of a lock
>>> seems unnecessary.
>> Yeah, that was a copy/paste that I was meaning to go back and update as
>> necessary but evidently forgot to.
>>
>>> Is there some follow on patches where this changes?
>> There will be. Still more re-work to be done for the GT targetted capture
>> support.
>>
>>> Adding the lock seems right though and should enable coredumps on
>>> multi-GT or from which are not the GT ordered WQ.
>> Yeah, with the capture on GT reset I was hitting the race and got a very
>> dead kernel a couple of times (because there were parallel captures for the
>> same failure - one via GT reset and one via DRM timeout). Not sure if it is
>> possible to hit with the current job specific captures but I don't see why
>> not. If you had parallel independent hangs on both GTs of a multi-GT system
>> then it seems like it would still go bang. Just much harder to hit.
>>
> Agree this is certainly possible to hit.
>
>>> Maybe also add helpers like...
>>>
>>> void xe_devcoredump_release()
>>> {
>>> 	spin_lock(&coredump->lock);
>>> 	coredump->captured = false;
>>> 	spin_unlock(&coredump->lock);
>>> }
>>>
>>> bool xe_devcoredump_acquire();
>>> {
>>> 	bool acquired;
>>>
>>> 	spin_lock(&coredump->lock);
>>> 	acquired = !coredump->captured;
>>> 	coredump->captured = true;
>>> 	spin_unlock(&coredump->lock);
>>>
>>> 	return acquired;
>>> }
>>>
>>> IMO - this makes this locking semantic a bit more clear.
>> Not convinced. There are exactly one places where each of these functions
>> would be used. And acquire/release doesn't seem like the right terminology
>> to me. Especially when the release is from a totally different call path.
>> The 'acquire' is an internally triggered thing by the driver but the
>> 'release' is in response to a sysfs write from the user. The intent is
>> really just alloc and free but for a singleton entity so there can only ever
>> be one instance at a time.  However, the above code is not the actual
>> alloc/free so calling it that would be wrong.
>>
> This comment is more about having a locking semantic that protects code
> paths rather than data. I think having a helper makes this part more
> palatable.
The intent is that the locking is around the 'captured' state - 
basically to provide atomic get-and-set support. With an atomic 
get-and-set, the code can ensure only one call path attempts the initial 
allocation. And then everything else is supposed to be safe because the 
content never changes after allocation until it is freed. But yeah, 
there is a separate race on free...

>
> However, now that I've written this, maybe it's better to just use a
> single large mutex that is held across xe_devcoredump,
> xe_devcoredump_free, and xe_devcoredump_read. Looking at the code, I'm
> pretty sure that xe_devcoredump_free and xe_devcoredump_read can race,
> with xe_devcoredump_read possibly dereferencing a NULL pointer in
> read.buffer.
Because a user could be reading the capture out via debugfs right when 
the DRM timer expires and expunges the capture asynchrously? Yeah, that 
could be a problem. Or are you seeing another race on top of that?

>   This mutex would also need to be primed with reclaim, so I
> guess it wouldn't be held in flush_work(&ss->work) or
> cancel_work_sync(&coredump->snapshot.work), but everywhere else in the
> aforementioned functions.
Not following about reclaim. Are you meaning because the shrinker might 
want to free the coredump in order to reclaim memory? Does 'primed with 
reclaim' mean it is or is not allowed to be acquired inside the reclaim 
paths?

>
> Do you think that would work? If so, it seems cleaner and would adhere
> to Sima's locking guidelines [1].
As above, I'm not sure what the reclaim connection is or what the 
implication of that are. But having a big-coredump-mutex-lock sounds 
like it should work to resolve both the double allocation issue and the 
free while reading issue. Only concern would be if any of the core dump 
trigger points are in places where a mutex lock is not allowed. I don't 
think they are though. The capture from the dead CT handler is 
potentially in atomic context but the intent is that will skip the top 
level xe->devcoredump stuff and just jump in at the 
xe_devcoredump_snapshot level, i.e. it has it's own private snapshot 
object that is internally allocated, printed and freed. So it would 
bypass the mutex lock (and also bypass the double allocation and free vs 
read problems).

John.

>
> Matt
>
> [1] https://blog.ffwll.ch/2022/07/locking-engineering.html
>
>> The implementation is more about back-off-when-busy than regular mutex
>> locking. It could potentially be done with a kref thing and inc-if-not-zero
>> functionality, but that also seemed more confusion than just having a
>> spinlock around a simple bool.
>>
>> John.
>>
>>>>    }
>>>>    static void devcoredump_snapshot(struct xe_devcoredump *coredump,
>>>> @@ -310,14 +314,18 @@ void xe_devcoredump(struct xe_exec_queue *q, struct xe_sched_job *job, const cha
>>>>    	struct xe_device *xe = gt_to_xe(q->gt);
>>>>    	struct xe_devcoredump *coredump = &xe->devcoredump;
>>>>    	va_list varg;
>>>> +	bool captured;
>>>> -	if (coredump->captured) {
>>>> +	spin_lock_irq(&coredump->lock);
>>>> +	captured = coredump->captured;
>>>> +	coredump->captured = true;
>>>> +	spin_unlock_irq(&coredump->lock);
>>>> +
>>>> +	if (captured) {
>>> Then here...
>>>
>>> if (!xe_devcoredump_acquire())
>>> 	msg...
>>> 	return;
>>>
>>>>    		drm_dbg(&xe->drm, "Multiple hangs are occurring, but only the first snapshot was taken\n");
>>>>    		return;
>>>>    	}
>>>> -	coredump->captured = true;
>>>> -
>>>>    	va_start(varg, fmt);
>>>>    	coredump->snapshot.reason = kvasprintf(GFP_KERNEL, fmt, varg);
>>>>    	va_end(varg);
>>>> @@ -342,6 +350,8 @@ static void xe_driver_devcoredump_fini(void *arg)
>>>>    int xe_devcoredump_init(struct xe_device *xe)
>>>>    {
>>>> +	spin_lock_init(&xe->devcoredump.lock);
>>>> +
>>>>    	return devm_add_action_or_reset(xe->drm.dev, xe_driver_devcoredump_fini, &xe->drm);
>>>>    }
>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>> index e6234e887102..6b776abd8bf4 100644
>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
>>>> @@ -80,6 +80,8 @@ struct xe_devcoredump_snapshot {
>>>>     * for reading the information.
>>>>     */
>>>>    struct xe_devcoredump {
>>>> +	/** @lock: protects access to 'captured' to prevent concurrent captures */
>>>> +	spinlock_t lock;
>>>>    	/** @captured: The snapshot of the first hang has already been taken. */
>>> Protected by @lock.
>>>
>>> Matt
>>>
>>>>    	bool captured;
>>>>    	/** @snapshot: Snapshot is captured at time of the first crash */
>>>> -- 
>>>> 2.47.0
>>>>