[PATCH] drm/xe: Fix missing runtime outer protection for ggtt_remove_node

Fri May 31 16:31:34 UTC 2024

On Fri, May 31, 2024 at 04:15:31PM +0000, Matthew Brost wrote:
> On Fri, May 31, 2024 at 12:02:05PM -0400, Rodrigo Vivi wrote:
> > Defer the ggtt node removal to a thread if runtime_pm is not active.
> > 
> > The ggtt node removal can be called from multiple places, including
> > places where we cannot protect with outer callers and places we are
> > within other locks. So, try to grab the runtime reference if the
> > device is already active, otherwise defer the removal to a separate
> > thread from where we are sure we can wake the device up.
> > 
> > Cc: Paulo Zanoni <paulo.r.zanoni at intel.com>
> > Cc: Francois Dugast <francois.dugast at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_ggtt.c | 56 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 51 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> > index b01a670fecb8..d63bf1a744b5 100644
> > --- a/drivers/gpu/drm/xe/xe_ggtt.c
> > +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> > @@ -443,16 +443,14 @@ int xe_ggtt_insert_bo(struct xe_ggtt *ggtt, struct xe_bo *bo)
> >  	return __xe_ggtt_insert_bo_at(ggtt, bo, 0, U64_MAX);
> >  }
> >  
> > -void xe_ggtt_remove_node(struct xe_ggtt *ggtt, struct drm_mm_node *node,
> > -			 bool invalidate)
> > +static void ggtt_remove_node(struct xe_ggtt *ggtt, struct drm_mm_node *node,
> > +			     bool invalidate)
> >  {
> >  	struct xe_device *xe = tile_to_xe(ggtt->tile);
> >  	bool bound;
> >  	int idx;
> >  
> >  	bound = drm_dev_enter(&xe->drm, &idx);
> > -	if (bound)
> > -		xe_pm_runtime_get_noresume(xe);
> >  
> >  	mutex_lock(&ggtt->lock);
> >  	if (bound)
> > @@ -467,10 +465,58 @@ void xe_ggtt_remove_node(struct xe_ggtt *ggtt, struct drm_mm_node *node,
> >  	if (invalidate)
> >  		xe_ggtt_invalidate(ggtt);
> >  
> > -	xe_pm_runtime_put(xe);
> >  	drm_dev_exit(idx);
> >  }
> >  
> > +struct remove_node_work {
> > +	struct work_struct work;
> > +	struct xe_ggtt *ggtt;
> > +	struct drm_mm_node *node;
> > +	bool invalidate;
> > +};
> > +
> > +static void ggtt_remove_node_work_func(struct work_struct *work)
> > +{
> > +	struct remove_node_work *remove_node = container_of(work, struct remove_node_work, work);
> > +	struct xe_device *xe = tile_to_xe(remove_node->ggtt->tile);
> > +
> > +	xe_pm_runtime_get(xe);
> > +	ggtt_remove_node(remove_node->ggtt, remove_node->node, remove_node->invalidate);
> > +	xe_pm_runtime_put(xe);
> > +
> > +	kfree(remove_node);
> > +}
> > +
> > +static void ggtt_queue_remove_node(struct xe_ggtt *ggtt, struct drm_mm_node *node,
> > +				   bool invalidate)
> > +{
> > +	struct remove_node_work *remove_node;
> > +
> > +	remove_node = kmalloc(sizeof(*remove_node), GFP_KERNEL);
> 
> Are we sure this code cannot be in an atomic context or in the path of a
> dma-fence? If either of the former is true, then we cannot allocate
> memory here. 

not sure tbh

> Alternatively, we could use GFP_ATOMIC or preallocate
> 'remove_node_work' as part of the initial GGTT node allocation. The
> latter requires a bit more memory, but GGTT allocations are heavyweight
> objects, and using a bit more memory seems fine to me.

I had thought about simply going with GFP_ATOMIC.

The pre-allocation doesn't work. Unless we encapsulate the drm_mm_node
into a xe_mm_node with the removal info in it.

> Also if we do the
> later, maybe just add the node to a list and kick a dedicated work item
> which process all nodes on the list.

The list with the single worker also sounds elegant solution here,
to process all the removals in the same way. But for simplicity,
if GFP_ATOMIC works I would prefer to go with this that minimizes
the thread and it has 1:1 work:item.

> 
> > +	if (!remove_node)
> > +		return;
> > +
> > +	INIT_WORK(&remove_node->work, ggtt_remove_node_work_func);
> > +	remove_node->ggtt = ggtt;
> > +	remove_node->node = node;
> > +	remove_node->invalidate = invalidate;
> > +
> > +	queue_work(system_unbound_wq, &remove_node->work);
> 
> I think we need to be careful with system wq usage. Recently we have had
> two bugs [1][2] exposed in 6.9 in which we deadlocked by using system
> wqs. I think it is likely safer to use a driver dedicated queue here.

ouch! probably good to create a dedicated wq for xe_ggtt so we don't
interfeer with anything else.

> 
> Other than these questions, design of patch (try grabbing a PM, if we
> can't defer to worker) LGTM.
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/133210/
> [2] https://patchwork.freedesktop.org/patch/586095/?series=131904&rev=1
> 
> > +}
> > +
> > +void xe_ggtt_remove_node(struct xe_ggtt *ggtt, struct drm_mm_node *node,
> > +			 bool invalidate)
> > +{
> > +	struct xe_device *xe = tile_to_xe(ggtt->tile);
> > +
> > +	if (xe_pm_runtime_get_if_active(xe)) {
> > +		ggtt_remove_node(ggtt, node, invalidate);
> > +		xe_pm_runtime_put(xe);
> > +	} else {
> > +		ggtt_queue_remove_node(ggtt, node, invalidate);
> > +	}
> > +}
> > +
> >  void xe_ggtt_remove_bo(struct xe_ggtt *ggtt, struct xe_bo *bo)
> >  {
> >  	if (XE_WARN_ON(!bo->ggtt_node.size))
> > -- 
> > 2.45.1
> >