[PATCH] drm/xe/pxp: Don't kill queues while holding the spinlock

Thu Feb 13 20:19:46 UTC 2025

On Thu, Feb 13, 2025 at 09:23:26AM -0800, Daniele Ceraolo Spurio wrote:
> 
> 
> On 2/12/2025 10:42 PM, Dan Carpenter wrote:
> > On Wed, Feb 12, 2025 at 05:26:55PM -0800, Matthew Brost wrote:
> > > On Wed, Feb 12, 2025 at 04:40:32PM -0800, Daniele Ceraolo Spurio wrote:
> > > > xe_exec_queue_kill can sleep, so we can't call it from under the lock.
> > > > We can instead move the queues to a separate list and then kill them all
> > > > after we release the lock.
> > > > 
> > > > Since being in the list is used to track whether RPM cleanup is needed,
> > > > we can no longer defer that to queue_destroy, so we perform it
> > > > immediately instead.
> > > > 
> > > > Reported-by: Dan Carpenter <dan.carpenter at linaro.org>
> > > > Fixes: f8caa80154c4 ("drm/xe/pxp: Add PXP queue tracking and session start")
> > > > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> > > Patch LGTM but can this actually happen though? i.e. Can or do we enable
> > > PXP on LR queues?
> > > 
> > This isn't really an answer to your question, but when I reported this
> > bug I didn't notice the if (xe_vm_in_preempt_fence_mode()) check in
> > xe_vm_remove_compute_exec_queue().  So it's possible that this was a
> > false positive?
> 
> We currently don't have a use-case where we need a vm in preempt_fence_mode
> for a queue that uses PXP, but I didn't block the combination because there
> is a chance we might want to use it in the future (compute PXP is supported
> by the HW, even if we don't currently support it in Xe), so a user can still
> set things up that way.
> 
> > 
> > > Also as a follow should be add a might_sleep() to xe_exec_queue_kill to
> > > catch this type of bug immediately?
> > There is a might_sleep() in down_write().  If this is a real bug that
> > would have caught it.  The problem is that people don't generally test
> > with CONFIG_DEBUG_ATOMIC_SLEEP so the might_sleep() calls are turned off.
> 
> We do have CONFIG_DEBUG_ATOMIC_SLEEP enabled in CI (and I have it locally
> since I use the CI config), but since PXP + preempt_fence_mode is not an
> expected use-case we don't have any tests that cover that combination, so we
> return early from that xe_vm_remove_compute_exec_queue() and don't hit the
> down_write/might_sleep. I'll see if I can add a test to cover it, as there
> might be other issues I've missed.
> Also, I don't think it'd be right to add a might_sleep at the top of the
> exec_queue_kill() function either, because if a caller is sure that
> xe_vm_in_preempt_fence_mode() is false they should be allowed to
> call exec_queue_kill() from atomic context.

I see what you are saying here but allowing something 'like we know we
not preempt queue so it is safe to kill in an atomic conetxt' seems
risky and a very odd thing to support. IMO we just make it clear that
this function can't be called in an atomic context.

We likely have some upcoming TLB invalidation changes too which I think
will move all queues to a per VM list with the list being protected by a
sleeping lock. Removal from this list should likely be done in kill too.
This is speculation however.

I agree some test cases for preempt queues and PXP would be a good idea
if this isn't explictly disallowed at the IOCTL level.

Matt

> 
> Thanks,
> Daniele
> 
> > 
> > regards,
> > dan carpenter
> > 
>