[PATCH 0/3] Rework work queue usage

Thu Mar 28 19:30:41 UTC 2024

On Thu, Mar 28, 2024 at 09:13:54AM -1000, htejun at gmail.com wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 02:02:47PM -0500, Lucas De Marchi wrote:
> > On Thu, Mar 28, 2024 at 11:21:44AM -0700, Matthew Brost wrote:
> > > Avoid sleeping or grabbing locks in work queues shared with the system.
> > > Recent changes to work queues [1] have exposed deadlocks [2] in Xe.
> 
> Can you elaborate it a bit? I'm having a bit of hard time imagining how the
> latest workqueue changes would have exposed deadlocks.
> 

Sure. Let me explain what is happening in CI failure.

The test creates 100s of exec queues that all can be preempted in
parallel. In the current code this results in each exec queue kicking a
worker which is scheduled on the system_unbound_wq, these workers wait
and sleep (using a waitqueue) on signaling from another worker. The
other worker, which is also scheduled system_unbound_wq, is processing a
queue which interacts with the GPU. I'm thinking the worker which
interacts with hardware gets straved by the waiter resulting in a
deadlock.

This patch changes the waiters to uses a device private ordered work
queue so at most we have 1 waiter a time. Regardless of the new work
queue behavior this a better design.

It is beyond my knowledge if the old behavior, albiet poorly designed,
should still work with the work queue changes in 6.9.

> > I think we need some of this information in the commit message in patch
> > 1. Because patch 1 simply says it's moving to a device private wq to
> > avoid hogging the system one, but the issue is much more serious.
> > 
> > Also, is the "Fixes:"  really correct? It seems more like a regression
> > from the wq changes and there could be other drivers showing similar
> > issues now. But it could alos be my lack of understanding of the real
> > issue.
> 
> I don't have enough context to tell whether this is a workqueue problem but
> if so we should definitely fix workqueue.
> 

It is beyond my knowledge if the old behavior, albeit poorly designed,
should still work with the work queue changes in 6.9.

Matt

> Thanks.
> 
> -- 
> tejun