[PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

Sat Jan 4 07:19:59 UTC 2025

On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:
>On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:
>>On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:
>>>On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:
>>>>On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:
>>>>>This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
>>>>>queue file lock usage."). While it's desired to have the mutex to
>>>>>protect only the reference to the exec queue, getting and dropping each
>>>>>mutex and then later getting the GPU timestamp, doesn't produce a
>>>>>correct result: it introduces multiple opportunities for the task to be
>>>>>scheduled out and thus wrecking havoc the deltas reported to userspace.
>>>>>
>>>>>Also, to better correlate the timestamp from the exec queues with the
>>>>>GPU, disable preemption so they can be updated without allowing the task
>>>>>to be scheduled out. We leave interrupts enabled as that shouldn't be
>>>>>enough disturbance for the deltas to matter to userspace.
>>>>
>>>>Like I said in the past, this is not trivial to solve and I 
>>>>would hate to add anything in the KMD to do so.
>>>
>>>I think the best we can do in the kernel side is to try to guarantee the
>>>correlated counters are sampled together... And that is already very
>>>good per my tests. Also, it'd not only be good from a testing
>>>perspective, but for any userspace trying to make sense of the 2
>>>counters.
>>>
>>>Note that this is not much different from how e.g. perf samples group
>>>events:
>>>
>>>	The unit of scheduling in perf is not an individual event, but rather an
>>>	event group, which may contain one or more events (potentially on
>>>	different PMUs). The notion of an event group is useful for ensuring
>>>	that a set of mathematically related events are all simultaneously
>>>	measured for the same period of time. For example, the number of L1
>>>	cache misses should not be larger than the number of L2 cache accesses.
>>>	Otherwise, it may happen that the events get multiplexed and their
>>>	measurements would no longer be comparable, making the analysis more
>>>	difficult.
>>>
>>>See __perf_event_read() that will call pmu->read() on all sibling events
>>>while disabling preemption:
>>>
>>>	perf_event_read()
>>>	{
>>>		...
>>>		preempt_disable();
>>>		event_cpu = __perf_event_read_cpu(event, event_cpu);
>>>		...
>>>		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
>>>		preempt_enable();
>>>		...
>>>	}
>>>
>>>so... at least there's prior art for that... for the same reason that
>>>userspace should see the values sampled together.
>>
>>Well, I have used the preempt_disable/enable when fixing some 
>>selftest (i915), but was not happy that there were still some rare 
>>failures. If reducing error rates is the intention, then it's fine. 
>>In my mind, the issue still exists and once in a while we would end 
>>up assessing such a failure. Maybe, in addition, fixing up the IGTs 
>>like you suggest below is a worthwhile option.
>
>for me this fix is not targeted at tests, even if it improves them a
>lot. It's more for consistent userspace behavior.
>
>>
>>>
>>>>
>>>>For IGT, why not just take 4 samples for the measurement 
>>>>(separate out the 2 counters)
>>>>
>>>>1. get gt timestamp in the first sample
>>>>2. get run ticks in the second sample
>>>>3. get run ticks in the third sample
>>>>4. get gt timestamp in the fourth sample
>>>>
>>>>Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run 
>>>>ticks delta.
>>>
>>>this won't fix it for the general case: you get rid of the > 100% case,
>>>you make the < 100% much worse.
>>
>>yeah, that's quite possible.
>>
>>>
>>>For a testing perspective I think the non-flaky solution is to stop
>>>calculating percentages and rather check that the execution timestamp
>>>recorded by the GPU very closely matches (minus gpu scheduling delays)
>>>the one we got via fdinfo once the fence signals and we wait for the job
>>>completion.
>>
>>Agree, we should change how we validate the counters in IGT.
>
>I have a wip patch to cleanup and submit to igt. I will submit it soon.

Just submitted that as the last patch in the series:
https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@intel.com/T/#u

but I'd also like to apply this one in the kernel and still looking for
a review.

thanks
Lucas De Marchi