[PATCH v3 4/5] drm/xe: Wait on killed exec queues

Matthew Brost matthew.brost at intel.com
Mon Nov 4 21:02:25 UTC 2024


On Mon, Nov 04, 2024 at 11:54:04AM -0600, Lucas De Marchi wrote:
> On Mon, Nov 04, 2024 at 08:29:40AM -0800, Matthew Brost wrote:
> > On Mon, Nov 04, 2024 at 06:38:14AM -0800, Lucas De Marchi wrote:
> > > When an exec queue is killed it triggers an async process of asking the
> > > GuC to schedule the context out. The timestamp in the context image is
> > > only updated when this process completes. In case a userspace process
> > > kills an exec and tries to read the timestamp, it may not get an updated
> > > runtime.
> > > 
> > > Add synchronization between the process reading the fdinfo and the exec
> > > queue being killed. After reading all the timestamps, wait on exec
> > > queues in the process of being killed. When that wait is over,
> > > xe_exec_queue_fini() was already called and updated the timestamps.
> > > 
> > > v2: Do not update pending_removal before validating user args
> > >     (Matthew Auld)
> > > 
> > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > > Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
> > >  drivers/gpu/drm/xe/xe_drm_client.c   | 7 +++++++
> > >  drivers/gpu/drm/xe/xe_exec_queue.c   | 6 ++++++
> > >  3 files changed, 18 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > index cb193234c7eca..ed6b34d4a8030 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -605,6 +605,11 @@ struct xe_file {
> > >  		 * which does things while being held.
> > >  		 */
> > >  		struct mutex lock;
> > > +		/**
> > > +		 * @exec_queue.pending_removal: items pending to be removed to
> > > +		 * synchronize GPU state update with ongoing query.
> > > +		 */
> > > +		atomic_t pending_removal;
> > 
> > Would the interface in 'linux/completion.h' be better here?
> 
> I think it'd be counter-intuitive in this context
> 
> 1) most of the time it's "done" rather than "pending", which is the opposite
> of the completion interface - note that you can't call complete_all()
>    multiple times
> 
> 2) we'd still need a counter to support multiple queues being killed if
>    we keep it on xef. Imagine userspace did:
> 
>    	for q in queues:
> 		ioctl_wrapper(fd, DRM_XE_EXEC_QUEUE_DESTROY, ...)
> 
> Not clear how you'd do the complete_all() thought. Could probably be
> more easily done if the completion was in the exec_queue, but the task
> reading the fdinfo doesn't have access since it's removed from the array
> 

Ah. When I briefly looked at this, I missed this was a per device
counter rather than per exec queue counter. My suugestion was based on
adding a completion to the exec queue which I see your point of not
being available in show_run_ticks after DRM_XE_EXEC_QUEUE_DESTROY.

> did you have a different way to integrate the complete in mind?
> 

Not really. I can't say I love a global counter / wait but can't
really think of anything else. The wait is outside of holding any
locks, right? Seems fine then.

I guess my only question is do we even care if races and is a little
inaccurate? If the answer is, not a big deal if stats are slightly off
then maybe drop the patch. Up to you. Don't consider any of these
comments as blocking.

Matt

> after writing this I'm thinking the wait could be the first thing when
> reading the fdinfo, even before  taking the force wake. That would
> synchronize well with the scenarion in which it's the same task doing
> the destroy/read-fdinfo and for others it wouldn't really matter.
> 
> Lucas De Marchi


More information about the Intel-xe mailing list