[PATCH v3 4/5] drm/xe: Wait on killed exec queues

Lucas De Marchi lucas.demarchi at intel.com
Mon Nov 4 21:29:17 UTC 2024


On Mon, Nov 04, 2024 at 01:02:25PM -0800, Matthew Brost wrote:
>On Mon, Nov 04, 2024 at 11:54:04AM -0600, Lucas De Marchi wrote:
>> On Mon, Nov 04, 2024 at 08:29:40AM -0800, Matthew Brost wrote:
>> > On Mon, Nov 04, 2024 at 06:38:14AM -0800, Lucas De Marchi wrote:
>> > > When an exec queue is killed it triggers an async process of asking the
>> > > GuC to schedule the context out. The timestamp in the context image is
>> > > only updated when this process completes. In case a userspace process
>> > > kills an exec and tries to read the timestamp, it may not get an updated
>> > > runtime.
>> > >
>> > > Add synchronization between the process reading the fdinfo and the exec
>> > > queue being killed. After reading all the timestamps, wait on exec
>> > > queues in the process of being killed. When that wait is over,
>> > > xe_exec_queue_fini() was already called and updated the timestamps.
>> > >
>> > > v2: Do not update pending_removal before validating user args
>> > >     (Matthew Auld)
>> > >
>> > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
>> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
>> > > Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> > > ---
>> > >  drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
>> > >  drivers/gpu/drm/xe/xe_drm_client.c   | 7 +++++++
>> > >  drivers/gpu/drm/xe/xe_exec_queue.c   | 6 ++++++
>> > >  3 files changed, 18 insertions(+)
>> > >
>> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > > index cb193234c7eca..ed6b34d4a8030 100644
>> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > > @@ -605,6 +605,11 @@ struct xe_file {
>> > >  		 * which does things while being held.
>> > >  		 */
>> > >  		struct mutex lock;
>> > > +		/**
>> > > +		 * @exec_queue.pending_removal: items pending to be removed to
>> > > +		 * synchronize GPU state update with ongoing query.
>> > > +		 */
>> > > +		atomic_t pending_removal;
>> >
>> > Would the interface in 'linux/completion.h' be better here?
>>
>> I think it'd be counter-intuitive in this context
>>
>> 1) most of the time it's "done" rather than "pending", which is the opposite
>> of the completion interface - note that you can't call complete_all()
>>    multiple times
>>
>> 2) we'd still need a counter to support multiple queues being killed if
>>    we keep it on xef. Imagine userspace did:
>>
>>    	for q in queues:
>> 		ioctl_wrapper(fd, DRM_XE_EXEC_QUEUE_DESTROY, ...)
>>
>> Not clear how you'd do the complete_all() thought. Could probably be
>> more easily done if the completion was in the exec_queue, but the task
>> reading the fdinfo doesn't have access since it's removed from the array
>>
>
>Ah. When I briefly looked at this, I missed this was a per device
>counter rather than per exec queue counter. My suugestion was based on
>adding a completion to the exec queue which I see your point of not
>being available in show_run_ticks after DRM_XE_EXEC_QUEUE_DESTROY.
>
>> did you have a different way to integrate the complete in mind?
>>
>
>Not really. I can't say I love a global counter / wait but can't
>really think of anything else. The wait is outside of holding any
>locks, right? Seems fine then.

correct

>
>I guess my only question is do we even care if races and is a little
>inaccurate? If the answer is, not a big deal if stats are slightly off
>then maybe drop the patch. Up to you. Don't consider any of these
>comments as blocking.

for the normal case of a top-like application, it shouldn't matter much.
If the fd is still open next read, it will catch up.

The main thing here is if we have a task doing a destroy + read fdinfo
and expecting the time used by that exec queue to be accounted. 

Lucas De Marchi

>
>Matt
>
>> after writing this I'm thinking the wait could be the first thing when
>> reading the fdinfo, even before  taking the force wake. That would
>> synchronize well with the scenarion in which it's the same task doing
>> the destroy/read-fdinfo and for others it wouldn't really matter.
>>
>> Lucas De Marchi


More information about the Intel-xe mailing list