[PATCH] drm/xe/pxp: Don't kill queues while holding the spinlock

Wed Feb 19 21:33:47 UTC 2025

On 2/18/2025 7:20 PM, Matthew Brost wrote:
> On Tue, Feb 18, 2025 at 07:18:30PM -0800, Matthew Brost wrote:
>> On Tue, Feb 18, 2025 at 04:38:34PM -0800, Daniele Ceraolo Spurio wrote:
>>>
>>> On 2/13/2025 12:19 PM, Matthew Brost wrote:
>>>> On Thu, Feb 13, 2025 at 09:23:26AM -0800, Daniele Ceraolo Spurio wrote:
>>>>> On 2/12/2025 10:42 PM, Dan Carpenter wrote:
>>>>>> On Wed, Feb 12, 2025 at 05:26:55PM -0800, Matthew Brost wrote:
>>>>>>> On Wed, Feb 12, 2025 at 04:40:32PM -0800, Daniele Ceraolo Spurio wrote:
>>>>>>>> xe_exec_queue_kill can sleep, so we can't call it from under the lock.
>>>>>>>> We can instead move the queues to a separate list and then kill them all
>>>>>>>> after we release the lock.
>>>>>>>>
>>>>>>>> Since being in the list is used to track whether RPM cleanup is needed,
>>>>>>>> we can no longer defer that to queue_destroy, so we perform it
>>>>>>>> immediately instead.
>>>>>>>>
>>>>>>>> Reported-by: Dan Carpenter <dan.carpenter at linaro.org>
>>>>>>>> Fixes: f8caa80154c4 ("drm/xe/pxp: Add PXP queue tracking and session start")
>>>>>>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>>>>>> Patch LGTM but can this actually happen though? i.e. Can or do we enable
>>>>>>> PXP on LR queues?
>>>>>>>
>>>>>> This isn't really an answer to your question, but when I reported this
>>>>>> bug I didn't notice the if (xe_vm_in_preempt_fence_mode()) check in
>>>>>> xe_vm_remove_compute_exec_queue().  So it's possible that this was a
>>>>>> false positive?
>>>>> We currently don't have a use-case where we need a vm in preempt_fence_mode
>>>>> for a queue that uses PXP, but I didn't block the combination because there
>>>>> is a chance we might want to use it in the future (compute PXP is supported
>>>>> by the HW, even if we don't currently support it in Xe), so a user can still
>>>>> set things up that way.
>>>>>
>>>>>>> Also as a follow should be add a might_sleep() to xe_exec_queue_kill to
>>>>>>> catch this type of bug immediately?
>>>>>> There is a might_sleep() in down_write().  If this is a real bug that
>>>>>> would have caught it.  The problem is that people don't generally test
>>>>>> with CONFIG_DEBUG_ATOMIC_SLEEP so the might_sleep() calls are turned off.
>>>>> We do have CONFIG_DEBUG_ATOMIC_SLEEP enabled in CI (and I have it locally
>>>>> since I use the CI config), but since PXP + preempt_fence_mode is not an
>>>>> expected use-case we don't have any tests that cover that combination, so we
>>>>> return early from that xe_vm_remove_compute_exec_queue() and don't hit the
>>>>> down_write/might_sleep. I'll see if I can add a test to cover it, as there
>>>>> might be other issues I've missed.
>>>>> Also, I don't think it'd be right to add a might_sleep at the top of the
>>>>> exec_queue_kill() function either, because if a caller is sure that
>>>>> xe_vm_in_preempt_fence_mode() is false they should be allowed to
>>>>> call exec_queue_kill() from atomic context.
>>>> I see what you are saying here but allowing something 'like we know we
>>>> not preempt queue so it is safe to kill in an atomic conetxt' seems
>>>> risky and a very odd thing to support. IMO we just make it clear that
>>>> this function can't be called in an atomic context.
>>>>
>>>> We likely have some upcoming TLB invalidation changes too which I think
>>>> will move all queues to a per VM list with the list being protected by a
>>>> sleeping lock. Removal from this list should likely be done in kill too.
>>>> This is speculation however.
>>>>
>>>> I agree some test cases for preempt queues and PXP would be a good idea
>>>> if this isn't explictly disallowed at the IOCTL level.
>>> I have a test written locally and I've managed to repro the atomic sleep
>>> issue. Unfortunately, I have found a second issue with a locking inversion:
>>> we take pxp->mutex under the vm lock when we create an exec_queue that uses
>>> PXP, while here we do the opposite with the kill() taking the vm lock under
>>> pxp mutex. Not sure yet which of the 2 sides is easier to fix and therefore
>>> if this patch needs an update, so I'll hold the merge for now until I have a
>>> clearer idea.
>>>
>> I'd suggest q->pxp.link to pxp->kill.list + a worker to do the kill
>> then. The kill is already async so likely not a big to deal make it
>> more async.
>>
> Or if we want to make this generic... Add xe_exec_queue_kill_async with
> the same idea as above but in non-PXP specific way.

I've considered that, but I think I can more easily just move the kill() 
out of the pxp->mutex and remove the problem that way. I'm testing that 
solution now, will post it later if I don't hit any errors.

Daniele

>
> Matt
>
>> Matt
>>
>>> Daniele
>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Daniele
>>>>>
>>>>>> regards,
>>>>>> dan carpenter
>>>>>>