[RFC v1 3/9] drm/xe/hw_engine_group: Register hw engine group's exec queues

Thu Aug 15 14:55:25 UTC 2024

On Tue, Aug 13, 2024 at 02:24:27PM +0200, Thomas Hellström wrote:
> On Wed, 2024-07-17 at 20:09 +0000, Matthew Brost wrote:
> > On Wed, Jul 17, 2024 at 03:07:24PM +0200, Francois Dugast wrote:
> > > Add helpers to safely add and delete the exec queues attached to a
> > > hw
> > > engine group, and make use them at the time of creation and
> > > destruction
> > > of the exec queues. Keeping track of them is required to control
> > > the
> > > execution mode of the hw engine group.
> > > 
> > 
> > Missed a few more things...
> > 
> > > Signed-off-by: Francois Dugast <francois.dugast at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_exec_queue.c |  7 +++++
> > >  drivers/gpu/drm/xe/xe_hw_engine.c  | 45
> > > ++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_hw_engine.h  |  4 +++
> > >  3 files changed, 56 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > index 0ba37835849b..645423a63434 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > @@ -192,6 +192,9 @@ void xe_exec_queue_destroy(struct kref *ref)
> > >  			xe_exec_queue_put(eq);
> > >  	}
> > >  
> > > +	if (q->vm)
> > 
> > I think this code path can be called GSC exec queues which don't have
> > a
> > q->hwe->hw_engine_group. So I think:
> > 
> > if (q->vm && q->hwe->hw_engine_group)
> > 	xe_hw_engine_group_del_exec_queue(q->hwe->hw_engine_group,
> > q);
> > 
> > > +		xe_hw_engine_group_del_exec_queue(q->hwe-
> > > >hw_engine_group, q);
> > > +
> > >  	q->ops->fini(q);
> > >  }
> > >  
> > > @@ -640,6 +643,10 @@ int xe_exec_queue_create_ioctl(struct
> > > drm_device *dev, void *data,
> > >  			if (XE_IOCTL_DBG(xe, err))
> > >  				goto put_exec_queue;
> > >  		}
> > > +
> > > +		err = xe_hw_engine_group_add_exec_queue(q->hwe-
> > > >hw_engine_group, q);
> > 
> > So this only called on non-VM bind exec queues. I think techincally
> > this
> > wrong as VM bind exec queues can use dma-fences plus the copy engine.
> > I
> > plan dropping the copy engine for these [1] and I don't think it
> > worth
> > fixing VM bind path with mode switching if we only are going to drop
> > this requirement soon. Let me just respin [1] and hopefully we can
> > prioritize though reviews so these two series land at roughly the
> > same
> > time. 
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/582003/?series=125608&rev=5
> 
> 
> Matt, Francois
> 
> I have expressed concerns a couple of times about ripping out the
> capability of gpu binding, mainly for two reasons.
> 
> 1) I think a pretty common use-case for a bo is
> a) clearing/readback
> b) binding
> 
> If we can have those executed by the same exec_queue then the binding's

In current design all clearing/readback is done by a kernel migrate
queue. Only kernel binds (page faults, rebinds in exec IOCTL, and rebinds
preempt worker) are done on this queue. 

All user binds on a per VM queue (or a user can create multiple queues
per VM, use case is VK sparse binding). Thus regardless if using GPU or
CPU for user binds, if a user binds depend on a BO move the bind job
will be held in the DRM scheduler until the BO move is complete. When a
job is ready in the scheduler, using the CPU will certainly have lower
latency than the GPU. Lastly, in general I have seen it is fairly easy
overwhelm the GuC with context switching lots of jobs, so anyway we can
relieve pressure their is also a win.

Changing the queue usage model would be pretty invasive and sharing
kernel migrate queue with user binds could create scenarios where kernel
operations are stuck behind unsignaled user input fences. That might
even deadlock, have to think it through. 

> dependencies are implicitly resolved and no latency incurred. If
> executed by the cpu, we have to add the latency of signalling the
> clearing fence and waking up the executing thread.
> 
> 2) Considering using page-table bos out of non-mappable VRAM.
> 
> It's entirely possible these concerns were considered non-valid at some
> point but in case they weren't, just a reminder.
> 

We'd have to rewrite our binding code anyways if pages-tables are
non-mappale VRAM. The CPU is always used to program the private part
page table entries. If dependecies exist, the GPU is used prune the
shared page tables via a job. Our code doesn't support using the GPU
writing the private part, that would basically be a rewrie.

> How difficult would it be to perform the switching also on these VM
> binds?

Not following this part.

Matt

> 
> /Thomas
> 
> > 
> > > +		if (err)
> > > +			goto put_exec_queue;
> > >  	}
> > >  
> > >  	mutex_lock(&xef->exec_queue.lock);
> > > diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c
> > > b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > index f8df85d25617..4dcc885a55c8 100644
> > > --- a/drivers/gpu/drm/xe/xe_hw_engine.c
> > > +++ b/drivers/gpu/drm/xe/xe_hw_engine.c
> > > @@ -1178,3 +1178,48 @@ u64 xe_hw_engine_read_timestamp(struct
> > > xe_hw_engine *hwe)
> > >  {
> > >  	return xe_mmio_read64_2x32(hwe->gt, RING_TIMESTAMP(hwe-
> > > >mmio_base));
> > >  }
> > > +
> > > +/**
> > > + * xe_hw_engine_group_add_exec_queue() - Add an exec queue to a hw
> > > engine group
> > > + * @group: The hw engine group
> > > + * @q: The exec_queue
> > > + *
> > > + * Return: 0 on success,
> > > + *	    -EINTR if the lock could not be acquired
> > > + */
> > > +int xe_hw_engine_group_add_exec_queue(struct xe_hw_engine_group
> > > *group, struct xe_exec_queue *q)
> > > +{
> > > +	int err;
> > > +
> > 
> > Also assert here this is not a VM bind queue (e.g. EXEC_QUEUE_FLAG_VM
> > is
> > clear).
> > 
> > > +	err = down_write_killable(&group->mode_sem);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	list_add(&q->hw_engine_group_link, &group-
> > > >exec_queue_list);
> > 
> > I don't see where INIT_LIST_HEAD is called on group->exec_queue_list.
> > I
> > think that should unconditionally be done in __xe_exec_queue_alloc.
> > 
> > Matt
> > 
> > > +	up_write(&group->mode_sem);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_hw_engine_group_del_exec_queue() - Delete an exec queue from
> > > a hw engine group
> > > + * @group: The hw engine group
> > > + * @q: The exec_queue
> > > + *
> > > + * Return: 0 on success,
> > > + *	    -EINTR if the lock could not be acquired
> > > + */
> > > +int xe_hw_engine_group_del_exec_queue(struct xe_hw_engine_group
> > > *group, struct xe_exec_queue *q)
> > > +{
> > > +	int err;
> > > +
> > > +	err = down_write_killable(&group->mode_sem);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	if (!list_empty(&group->exec_queue_list))
> > > +		list_del(&q->hw_engine_group_link);
> > > +	up_write(&group->mode_sem);
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_hw_engine.h
> > > b/drivers/gpu/drm/xe/xe_hw_engine.h
> > > index 7f2d27c0ba1a..ce59d83a75ad 100644
> > > --- a/drivers/gpu/drm/xe/xe_hw_engine.h
> > > +++ b/drivers/gpu/drm/xe/xe_hw_engine.h
> > > @@ -9,6 +9,7 @@
> > >  #include "xe_hw_engine_types.h"
> > >  
> > >  struct drm_printer;
> > > +struct xe_exec_queue;
> > >  
> > >  #ifdef CONFIG_DRM_XE_JOB_TIMEOUT_MIN
> > >  #define XE_HW_ENGINE_JOB_TIMEOUT_MIN CONFIG_DRM_XE_JOB_TIMEOUT_MIN
> > > @@ -70,4 +71,7 @@ static inline bool xe_hw_engine_is_valid(struct
> > > xe_hw_engine *hwe)
> > >  const char *xe_hw_engine_class_to_str(enum xe_engine_class class);
> > >  u64 xe_hw_engine_read_timestamp(struct xe_hw_engine *hwe);
> > >  
> > > +int xe_hw_engine_group_add_exec_queue(struct xe_hw_engine_group
> > > *group, struct xe_exec_queue *q);
> > > +int xe_hw_engine_group_del_exec_queue(struct xe_hw_engine_group
> > > *group, struct xe_exec_queue *q);
> > > +
> > >  #endif
> > > -- 
> > > 2.43.0
> > > 
>