[PATCH v3 4/5] drm/xe: Wait on killed exec queues
Lucas De Marchi
lucas.demarchi at intel.com
Mon Nov 4 21:29:17 UTC 2024
On Mon, Nov 04, 2024 at 01:02:25PM -0800, Matthew Brost wrote:
>On Mon, Nov 04, 2024 at 11:54:04AM -0600, Lucas De Marchi wrote:
>> On Mon, Nov 04, 2024 at 08:29:40AM -0800, Matthew Brost wrote:
>> > On Mon, Nov 04, 2024 at 06:38:14AM -0800, Lucas De Marchi wrote:
>> > > When an exec queue is killed it triggers an async process of asking the
>> > > GuC to schedule the context out. The timestamp in the context image is
>> > > only updated when this process completes. In case a userspace process
>> > > kills an exec and tries to read the timestamp, it may not get an updated
>> > > runtime.
>> > >
>> > > Add synchronization between the process reading the fdinfo and the exec
>> > > queue being killed. After reading all the timestamps, wait on exec
>> > > queues in the process of being killed. When that wait is over,
>> > > xe_exec_queue_fini() was already called and updated the timestamps.
>> > >
>> > > v2: Do not update pending_removal before validating user args
>> > > (Matthew Auld)
>> > >
>> > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
>> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
>> > > Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> > > ---
>> > > drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
>> > > drivers/gpu/drm/xe/xe_drm_client.c | 7 +++++++
>> > > drivers/gpu/drm/xe/xe_exec_queue.c | 6 ++++++
>> > > 3 files changed, 18 insertions(+)
>> > >
>> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > > index cb193234c7eca..ed6b34d4a8030 100644
>> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > > @@ -605,6 +605,11 @@ struct xe_file {
>> > > * which does things while being held.
>> > > */
>> > > struct mutex lock;
>> > > + /**
>> > > + * @exec_queue.pending_removal: items pending to be removed to
>> > > + * synchronize GPU state update with ongoing query.
>> > > + */
>> > > + atomic_t pending_removal;
>> >
>> > Would the interface in 'linux/completion.h' be better here?
>>
>> I think it'd be counter-intuitive in this context
>>
>> 1) most of the time it's "done" rather than "pending", which is the opposite
>> of the completion interface - note that you can't call complete_all()
>> multiple times
>>
>> 2) we'd still need a counter to support multiple queues being killed if
>> we keep it on xef. Imagine userspace did:
>>
>> for q in queues:
>> ioctl_wrapper(fd, DRM_XE_EXEC_QUEUE_DESTROY, ...)
>>
>> Not clear how you'd do the complete_all() thought. Could probably be
>> more easily done if the completion was in the exec_queue, but the task
>> reading the fdinfo doesn't have access since it's removed from the array
>>
>
>Ah. When I briefly looked at this, I missed this was a per device
>counter rather than per exec queue counter. My suugestion was based on
>adding a completion to the exec queue which I see your point of not
>being available in show_run_ticks after DRM_XE_EXEC_QUEUE_DESTROY.
>
>> did you have a different way to integrate the complete in mind?
>>
>
>Not really. I can't say I love a global counter / wait but can't
>really think of anything else. The wait is outside of holding any
>locks, right? Seems fine then.
correct
>
>I guess my only question is do we even care if races and is a little
>inaccurate? If the answer is, not a big deal if stats are slightly off
>then maybe drop the patch. Up to you. Don't consider any of these
>comments as blocking.
for the normal case of a top-like application, it shouldn't matter much.
If the fd is still open next read, it will catch up.
The main thing here is if we have a task doing a destroy + read fdinfo
and expecting the time used by that exec queue to be accounted.
Lucas De Marchi
>
>Matt
>
>> after writing this I'm thinking the wait could be the first thing when
>> reading the fdinfo, even before taking the force wake. That would
>> synchronize well with the scenarion in which it's the same task doing
>> the destroy/read-fdinfo and for others it wouldn't really matter.
>>
>> Lucas De Marchi
More information about the Intel-xe
mailing list