[PATCH 0/7] drm/xe: Per client usage

Tue Apr 16 18:29:55 UTC 2024

On Tue, Apr 16, 2024 at 03:22:21PM +0100, Tvrtko Ursulin wrote:
>
>On 16/04/2024 14:51, Lucas De Marchi wrote:
>>Forgot to Cc Michal, doing now.
>>
>>Lucas De Marchi
>>
>>On Tue, Apr 16, 2024 at 08:30:33AM -0500, Lucas De Marchi wrote:
>>>On Tue, Apr 16, 2024 at 09:37:44AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>>On 16/04/2024 04:04, Lucas De Marchi wrote:
>>>>>Add per-client usage statistics to xe. This ports xe to use the common
>>>>>method in drm to export the usage to userspace per client (where 1
>>>>>client == 1 drm fd open).
>>>>>
>>>>>However insted of using the current format, this creates a new one with
>>>>>the unit "ticks". The intention here is not to mix the GPU clock domain
>>>>>with the CPU clock. It allows to cover a few more use cases without
>>>>>extra complications.
>>>>>
>>>>>Last patch was a quick implemenation of a gputop-like tool in python.
>>>>>I ended doing it to cross check the gputop implementation. I's not
>>>>>really meant to be applied here.
>>>>>
>>>>>I tested this on DG2 and TGL with kmscube (console-only) and vkcube
>>>>>(in a gnome session), but it would be good to soak this under more
>>>>>tests. The biggest goal for this patch series right now is to get
>>>>>consensus on the new UAPI.
>>>>>
>>>>>TODO: Add documentation on top with the new interface.
>>>>
>>>>Yeah a drm-usage-stats.rst patch would be nice to have in the 
>>>>RFC so one does not have to look into the driver implementation 
>>>>to discuss the proposed uapi.
>>>>
>>>>Nevertheless I understand the proposal is to add this:
>>>>
>>>>drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks
>>>
>>>yes, the gputop patch was more explicit about this. Should had added in
>>>the kernel patch series too.
>>>
>>>>
>>>>That's two values per key. I guess "one key value pair for per 
>>>>one line of text" does not get strictly broken and that you 
>>>>propose a heuristics in parsing to detect that the <RUNTIME> 
>>>>cannot be mis-interpreted as the unit?
>>>
>>>the current format is
>>>
>>>    drm-engine-<class>: <RUNTIME> ns
>>>
>>>the "ns" in the end should be parsed by userspace to know
>>>what it is about.
>
>Right.
>
>>>
>>>>
>>>>Not sure it is a good idea though. If you instead added a new 
>>>>key for the gpu time what would be the downside in your view? 
>>>>Like:
>>>>
>>>>drm-engine-<class>: <uint> ticks
>>>>drm-ticks-<class>: <uint>
>>>>
>>>>Or maybe even obsfuscate/generalise as:
>>>>
>>>>drm-engine-<class>: <uint> gpu-time
>>>>drm-gpu-time-<class>: <uint>
>>>
>>>I think both work, but I fail to see the advantage. This alternative is
>>>slightly heavier on the parsing side since you have to correlate the
>>>values from 2 keys, possibly dealing with them appearing in different
>>>order. The only possible breakage remains with this alternative: if
>>>userspace didn't parse the unit before. I checked nvtop and htop and
>>>they were doing the right thing. I sent a fix to igt a few weeks back
>>>for it to consider the unit:
>>>https://lore.kernel.org/igt-dev/20240405060056.59379-8-lucas.demarchi@intel.com/
>
>Advantages are that "drm-engine-something: 1234 5678 ticks" isn't 
>self-explanatory (intuitively humanly readable) and that it doesn't 

maybe I have a different expectation from procfs. When I do e.g.

# cat /proc/self/stat
3861283 (cat) R 3861233 3861283 3861231 34816 3861283 4194304 90 0 0 0 0 0 0 0 20 0 1 0 1321348797 8560640 384 18446744073709551615 93979016876032 93979016892449 140720658378704 0 0 0 0 0 0 0 0 0 17 51 0 0 0 0 0 93979016907440 93979016908904 93979037196288 140720658380605 140720658380625 140720658380625 140720658382827 0

it doesn't seem to me "intuitively humanly readable" was the first
concern for people adding files in procfs :)... I'd rather think "machine
readable" was more important.

>diverge from the one value per key plus unit format. Latter we would 
>then document clearly.
>
>Different keys potentially appearing in different order does not 
>matter since userspace already has to handle that.
>
>>>>Potentially could also add a key saying how much wall time is 
>>>>one unit of GPU time.
>>>
>>>I wouldn't add it really as it may not make sense depending on the
>>>vendor and or usage. Examples: the gpu time may be different for
>>>different engines depending on where they are located (tile/gt). The
>>>correlation with CPU time is different when running in VF mode, and may
>>>change in runtime depending on the number of VFs. +Michal.
>
>Yes, that's why I said "potentially", which was supposed to mean if 
>and where it makes sense and perhaps adds value.
>
>>>Also, if the userspace side really wants to know (why would it?)
>>>it could be just calculate from 2 samples (possibly repeated a few
>>>times as it updates the output).
>>>
>>>>
>>>>Or.. would even the existing drm-cycles, plus abuse of 
>>>>drm-maxfreq, work? Ticks == cycles, maxfreq == ticks per wall 
>>>>second.
>>>
>>>I think it'd be up to gpu vendor what clock backs this time. For the
>>>current Intel cards, it's the refclock and it doesn't vary the
>>>frequency.
>
>Right, but that doesn't matter. What I was saying is that if you 
>exposed ticks in drm-cycles and tick frequency in drm-maxfreq it would 
>even work, no? (Assuming support for those two was actually in 
>igt_drm_fdinfo/clients/gputop and could be used as fallback to time 
>based stats.)

oh... I was looking at the output for i915 and missed that we had
drm-cycles as currently i915 doesn't use it. It seems to be a similar
thing. I agree the drm-maxfreq- is unfortunate and that we don't
actually have support for that in gputop.

So, instead of the 2 numbers + different unit, I can adapt this to
rather use drm-cycles. However for maxfreq, it doesn't seem to be
what we need since it doesn't scale for VF. It brings back the cpu clock
domain this series is trying to avoid. The difference is that using
drm-cycles- and drm-maxfreq- you are expecting the userspace to do
(let me know if I interpreted the docs wrong):

	s1 = sample()
	sleep(period)
	s2 = sample()
	engine_utilization = ((s2.drm_cycles * s2.drm_max_freq) - (s1.drm_cycles * s1.drm_max_freq)) / period

... considering the drm_max_freq may change from one call to the other.
if we simplify it and assume it doesn't change:

	engine_utilization = ((s2.drm_cycles - s1.drm_cycles) * drm_max_freq) / period

we'd need different drm_max_freq reported on VF driver that would need
to know the number of VFs enabled to scaled it correctly. Maybe this is
abusing the "drm-maxfreq" a little bit?

What if we had

drm-cycles-<keystr>: <uint>
drm-total-cycles-<keystr>: <uint>

Then the utilization can be done:

	s1 = sample()
	sleep(period)
	s2 = sample()
	engine_utilization = (s2.cycles - s1.cycles) / \
			     (s2.total_cycles - s1.total_cycles + 1);

Capacity still to be added above, but we'd need to clarify if
drm-total-cycles-<keystr> already accounts for it.

Here instead of the conversion to cpu clock, I'm expecting to read
"total_cycles" from HW and that being different (slower) for VF.
AFAICS this is not the case with this current polling implementation
since we are simply reading the RING_TIMESTAMP, but there are planned
changes to get it from GuC. Umesh/Michal Cc'ed may know better.

Lucas De Marchi

>
>Only problem is the name maxfreq is not really an exact fit, which is 
>why I am not convinced we should go this route. Perhaps could add 
>drm-cycle-frequency and then drm-cycles would fit.
>
>>>>Secondly, wrap behaviour every 25-30 seconds patch 6/7 describes 
>>>>definitely breaks the format spec and in my view should be 
>>>>worked around in the driver:
>>>>
>>>>"""
>>>>Values are not required to be constantly monotonic if it makes 
>>>>the driver
>>>>implementation easier, but are required to catch up with the 
>>>>previously reported
>>>>larger value within a reasonable period. Upon observing a value 
>>>>lower than what
>>>>was previously read, userspace is expected to stay with that 
>>>>larger previous
>>>>value until a monotonic update is seen.
>>>>"""
>>>
>>>but this is the behavior for dealing with "ns", not with "ticks". We can
>
>Applies to drm-cycles- too.
>
>>>add a worker to run every so often.  When I got the original patches
>>>from Umesh, we had this worker implemented, but dealing with that was
>>>complex and wasting a lot of cycles since it was always running (except
>>>if we didn't have any exec_queue). Here is my tentative to simplify it
>>>since I don't really see the need to account for a ~25s wrap around.
>
>At the moment I don't think diverging like that would be user 
>friendly. Apart from diverging it would only make the output useful to 
>tools which know to keep polling with a small enough period.
>
>The tools with support a configurable polling period such as 
>intel_gpu_top could even inadvertently fail. Or would need to start 
>embedding knowledge about maximum periods per driver/backend/whatever. 
>Or new fdinfo keys to signal that.
>
>Sticking with the same monotic scheme sounds preferrable to me.
>
>And if wrap is 25-30 seconds why it was wasting that many cycles? Some 
>worker only needs to run once every 20 seconds or so and that is it, 
>no?

every 25-30s for every card regardless of having a userspace tool
quering or not seems wasteful to me.

Lucas De Marchi

>
>Regards,
>
>Tvrtko