[PATCH v3 4/5] drm/xe: Wait on killed exec queues
Matthew Brost
matthew.brost at intel.com
Mon Nov 4 21:02:25 UTC 2024
On Mon, Nov 04, 2024 at 11:54:04AM -0600, Lucas De Marchi wrote:
> On Mon, Nov 04, 2024 at 08:29:40AM -0800, Matthew Brost wrote:
> > On Mon, Nov 04, 2024 at 06:38:14AM -0800, Lucas De Marchi wrote:
> > > When an exec queue is killed it triggers an async process of asking the
> > > GuC to schedule the context out. The timestamp in the context image is
> > > only updated when this process completes. In case a userspace process
> > > kills an exec and tries to read the timestamp, it may not get an updated
> > > runtime.
> > >
> > > Add synchronization between the process reading the fdinfo and the exec
> > > queue being killed. After reading all the timestamps, wait on exec
> > > queues in the process of being killed. When that wait is over,
> > > xe_exec_queue_fini() was already called and updated the timestamps.
> > >
> > > v2: Do not update pending_removal before validating user args
> > > (Matthew Auld)
> > >
> > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > > Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
> > > drivers/gpu/drm/xe/xe_drm_client.c | 7 +++++++
> > > drivers/gpu/drm/xe/xe_exec_queue.c | 6 ++++++
> > > 3 files changed, 18 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > index cb193234c7eca..ed6b34d4a8030 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -605,6 +605,11 @@ struct xe_file {
> > > * which does things while being held.
> > > */
> > > struct mutex lock;
> > > + /**
> > > + * @exec_queue.pending_removal: items pending to be removed to
> > > + * synchronize GPU state update with ongoing query.
> > > + */
> > > + atomic_t pending_removal;
> >
> > Would the interface in 'linux/completion.h' be better here?
>
> I think it'd be counter-intuitive in this context
>
> 1) most of the time it's "done" rather than "pending", which is the opposite
> of the completion interface - note that you can't call complete_all()
> multiple times
>
> 2) we'd still need a counter to support multiple queues being killed if
> we keep it on xef. Imagine userspace did:
>
> for q in queues:
> ioctl_wrapper(fd, DRM_XE_EXEC_QUEUE_DESTROY, ...)
>
> Not clear how you'd do the complete_all() thought. Could probably be
> more easily done if the completion was in the exec_queue, but the task
> reading the fdinfo doesn't have access since it's removed from the array
>
Ah. When I briefly looked at this, I missed this was a per device
counter rather than per exec queue counter. My suugestion was based on
adding a completion to the exec queue which I see your point of not
being available in show_run_ticks after DRM_XE_EXEC_QUEUE_DESTROY.
> did you have a different way to integrate the complete in mind?
>
Not really. I can't say I love a global counter / wait but can't
really think of anything else. The wait is outside of holding any
locks, right? Seems fine then.
I guess my only question is do we even care if races and is a little
inaccurate? If the answer is, not a big deal if stats are slightly off
then maybe drop the patch. Up to you. Don't consider any of these
comments as blocking.
Matt
> after writing this I'm thinking the wait could be the first thing when
> reading the fdinfo, even before taking the force wake. That would
> synchronize well with the scenarion in which it's the same task doing
> the destroy/read-fdinfo and for others it wouldn't really matter.
>
> Lucas De Marchi
More information about the Intel-xe
mailing list