[PATCH 0/7] drm/xe: Per client usage

Mon Apr 22 17:17:57 UTC 2024

On Mon, Apr 22, 2024 at 11:40:33AM +0100, Tvrtko Ursulin wrote:
>
>On 20/04/2024 00:51, Umesh Nerlige Ramappa wrote:
>>On Fri, Apr 19, 2024 at 11:44:46AM +0100, Tvrtko Ursulin wrote:
>>>
>>>On 18/04/2024 00:19, Lucas De Marchi wrote:
>>>>On Wed, Apr 17, 2024 at 01:35:29PM GMT, Umesh Nerlige Ramappa wrote:
>>>>>On Wed, Apr 17, 2024 at 02:05:40PM -0500, Lucas De Marchi wrote:
>>>>>>On Wed, Apr 17, 2024 at 09:51:42AM GMT, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>>On 16/04/2024 19:29, Lucas De Marchi wrote:
>>>>>>>>On Tue, Apr 16, 2024 at 03:22:21PM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>>On 16/04/2024 14:51, Lucas De Marchi wrote:
>>>>>>>>>>Forgot to Cc Michal, doing now.
>>>>>>>>>>
>>>>>>>>>>Lucas De Marchi
>>>>>>>>>>
>>>>>>>>>>On Tue, Apr 16, 2024 at 08:30:33AM -0500, Lucas De Marchi wrote:
>>>>>>>>>>>On Tue, Apr 16, 2024 at 09:37:44AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>On 16/04/2024 04:04, Lucas De Marchi wrote:
>>>>>>>>>>>>>Add per-client usage statistics to xe. This 
>>>>>>>>>>>>>ports xe to use the common
>>>>>>>>>>>>>method in drm to export the usage to userspace 
>>>>>>>>>>>>>per client (where 1
>>>>>>>>>>>>>client == 1 drm fd open).
>>>>>>>>>>>>>
>>>>>>>>>>>>>However insted of using the current format, 
>>>>>>>>>>>>>this creates a new one with
>>>>>>>>>>>>>the unit "ticks". The intention here is not to 
>>>>>>>>>>>>>mix the GPU clock domain
>>>>>>>>>>>>>with the CPU clock. It allows to cover a few 
>>>>>>>>>>>>>more use cases without
>>>>>>>>>>>>>extra complications.
>>>>>>>>>>>>>
>>>>>>>>>>>>>Last patch was a quick implemenation of a 
>>>>>>>>>>>>>gputop-like tool in python.
>>>>>>>>>>>>>I ended doing it to cross check the gputop 
>>>>>>>>>>>>>implementation. I's not
>>>>>>>>>>>>>really meant to be applied here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>I tested this on DG2 and TGL with kmscube 
>>>>>>>>>>>>>(console-only) and vkcube
>>>>>>>>>>>>>(in a gnome session), but it would be good to 
>>>>>>>>>>>>>soak this under more
>>>>>>>>>>>>>tests. The biggest goal for this patch series 
>>>>>>>>>>>>>right now is to get
>>>>>>>>>>>>>consensus on the new UAPI.
>>>>>>>>>>>>>
>>>>>>>>>>>>>TODO: Add documentation on top with the new interface.
>>>>>>>>>>>>
>>>>>>>>>>>>Yeah a drm-usage-stats.rst patch would be nice 
>>>>>>>>>>>>to have in the RFC so one does not have to look 
>>>>>>>>>>>>into the driver implementation to discuss the 
>>>>>>>>>>>>proposed uapi.
>>>>>>>>>>>>
>>>>>>>>>>>>Nevertheless I understand the proposal is to add this:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks
>>>>>>>>>>>
>>>>>>>>>>>yes, the gputop patch was more explicit about 
>>>>>>>>>>>this. Should had added in
>>>>>>>>>>>the kernel patch series too.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>That's two values per key. I guess "one key 
>>>>>>>>>>>>value pair for per one line of text" does not 
>>>>>>>>>>>>get strictly broken and that you propose a 
>>>>>>>>>>>>heuristics in parsing to detect that the 
>>>>>>>>>>>><RUNTIME> cannot be mis-interpreted as the unit?
>>>>>>>>>>>
>>>>>>>>>>>the current format is
>>>>>>>>>>>
>>>>>>>>>>>    drm-engine-<class>: <RUNTIME> ns
>>>>>>>>>>>
>>>>>>>>>>>the "ns" in the end should be parsed by userspace to know
>>>>>>>>>>>what it is about.
>>>>>>>>>
>>>>>>>>>Right.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Not sure it is a good idea though. If you 
>>>>>>>>>>>>instead added a new key for the gpu time what 
>>>>>>>>>>>>would be the downside in your view? Like:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <uint> ticks
>>>>>>>>>>>>drm-ticks-<class>: <uint>
>>>>>>>>>>>>
>>>>>>>>>>>>Or maybe even obsfuscate/generalise as:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <uint> gpu-time
>>>>>>>>>>>>drm-gpu-time-<class>: <uint>
>>>>>>>>>>>
>>>>>>>>>>>I think both work, but I fail to see the 
>>>>>>>>>>>advantage. This alternative is
>>>>>>>>>>>slightly heavier on the parsing side since you 
>>>>>>>>>>>have to correlate the
>>>>>>>>>>>values from 2 keys, possibly dealing with them 
>>>>>>>>>>>appearing in different
>>>>>>>>>>>order. The only possible breakage remains with 
>>>>>>>>>>>this alternative: if
>>>>>>>>>>>userspace didn't parse the unit before. I checked 
>>>>>>>>>>>nvtop and htop and
>>>>>>>>>>>they were doing the right thing. I sent a fix to 
>>>>>>>>>>>igt a few weeks back
>>>>>>>>>>>for it to consider the unit:
>>>>>>>>>>>https://lore.kernel.org/igt-dev/20240405060056.59379-8-lucas.demarchi@intel.com/
>>>>>>>>>
>>>>>>>>>Advantages are that "drm-engine-something: 1234 5678 
>>>>>>>>>ticks" isn't self-explanatory (intuitively humanly 
>>>>>>>>>readable) and that it doesn't
>>>>>>>>
>>>>>>>>maybe I have a different expectation from procfs. When I do e.g.
>>>>>>>>
>>>>>>>># cat /proc/self/stat
>>>>>>>>3861283 (cat) R 3861233 3861283 3861231 34816 3861283 
>>>>>>>>4194304 90 0 0 0 0 0 0 0 20 0 1 0 1321348797 8560640 384 
>>>>>>>>18446744073709551615 93979016876032 93979016892449 
>>>>>>>>140720658378704 0 0 0 0 0 0 0 0 0 17 51 0 0 0 0 0 
>>>>>>>>93979016907440 93979016908904 93979037196288 
>>>>>>>>140720658380605 140720658380625 140720658380625 
>>>>>>>>140720658382827 0
>>>>>>>>
>>>>>>>>it doesn't seem to me "intuitively humanly readable" was the first
>>>>>>>>concern for people adding files in procfs :)... I'd 
>>>>>>>>rather think "machine
>>>>>>>>readable" was more important.
>>>>>>>
>>>>>>>I think you are pushing the argument a bit now :) since 
>>>>>>>IMO we should evaluate drm-usage-stats.rst proposal more 
>>>>>>>in the context of drm-usage-stats and other fdinfo files, 
>>>>>>>rather than the whole of procfs. In other words if there 
>>>>>>>isn't a strong reason to regress this particular file lets 
>>>>>>>not do it.
>>>>>>
>>>>>>:) I like pushing arguments if it helps revisit decisions (human vs
>>>>>>machine readable for things in procfs). I'm not
>>>>>>trying to push the 2 counter approaches though. I think other reasons
>>>>>>like discussed below are enough to consider the other keys.
>>>>>>
>>>>>>TBH I was reluctant at first to add a separate uapi rather 
>>>>>>than re-using
>>>>>>drm-engine- without realizing there was already a second one (not
>>>>>>implemented in gputop).
>>>>>>
>>>>>>So AFAICS i915 and amdgpu use drm-engine-. msm and panfrost use
>>>>>>drm-cycles + drm-maxfreq.  And none of them seem suitable to xe.
>>>>>>
>>>>>>>
>>>>>>>>>diverge from the one value per key plus unit format. 
>>>>>>>>>Latter we would then document clearly.
>>>>>>>>>
>>>>>>>>>Different keys potentially appearing in different 
>>>>>>>>>order does not matter since userspace already has to 
>>>>>>>>>handle that.
>>>>>>>>>
>>>>>>>>>>>>Potentially could also add a key saying how much 
>>>>>>>>>>>>wall time is one unit of GPU time.
>>>>>>>>>>>
>>>>>>>>>>>I wouldn't add it really as it may not make sense 
>>>>>>>>>>>depending on the
>>>>>>>>>>>vendor and or usage. Examples: the gpu time may be different for
>>>>>>>>>>>different engines depending on where they are 
>>>>>>>>>>>located (tile/gt). The
>>>>>>>>>>>correlation with CPU time is different when 
>>>>>>>>>>>running in VF mode, and may
>>>>>>>>>>>change in runtime depending on the number of VFs. +Michal.
>>>>>>>>>
>>>>>>>>>Yes, that's why I said "potentially", which was 
>>>>>>>>>supposed to mean if and where it makes sense and 
>>>>>>>>>perhaps adds value.
>>>>>>>>>
>>>>>>>>>>>Also, if the userspace side really wants to know (why would it?)
>>>>>>>>>>>it could be just calculate from 2 samples 
>>>>>>>>>>>(possibly repeated a few
>>>>>>>>>>>times as it updates the output).
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Or.. would even the existing drm-cycles, plus 
>>>>>>>>>>>>abuse of drm-maxfreq, work? Ticks == cycles, 
>>>>>>>>>>>>maxfreq == ticks per wall second.
>>>>>>>>>>>
>>>>>>>>>>>I think it'd be up to gpu vendor what clock backs 
>>>>>>>>>>>this time. For the
>>>>>>>>>>>current Intel cards, it's the refclock and it doesn't vary the
>>>>>>>>>>>frequency.
>>>>>>>>>
>>>>>>>>>Right, but that doesn't matter. What I was saying is 
>>>>>>>>>that if you exposed ticks in drm-cycles and tick 
>>>>>>>>>frequency in drm-maxfreq it would even work, no? 
>>>>>>>>>(Assuming support for those two was actually in 
>>>>>>>>>igt_drm_fdinfo/clients/gputop and could be used as 
>>>>>>>>>fallback to time based stats.)
>>>>>>>>
>>>>>>>>oh... I was looking at the output for i915 and missed that we had
>>>>>>>>drm-cycles as currently i915 doesn't use it. It seems to 
>>>>>>>>be a similar
>>>>>>>>thing. I agree the drm-maxfreq- is unfortunate and that we don't
>>>>>>>>actually have support for that in gputop.
>>>>>>>>
>>>>>>>>So, instead of the 2 numbers + different unit, I can adapt this to
>>>>>>>>rather use drm-cycles. However for maxfreq, it doesn't seem to be
>>>>>>>>what we need since it doesn't scale for VF. It brings 
>>>>>>>>back the cpu clock
>>>>>>>>domain this series is trying to avoid. The difference is that using
>>>>>>>>drm-cycles- and drm-maxfreq- you are expecting the userspace to do
>>>>>>>>(let me know if I interpreted the docs wrong):
>>>>>>>>
>>>>>>>>    s1 = sample()
>>>>>>>>    sleep(period)
>>>>>>>>    s2 = sample()
>>>>>>>>    engine_utilization = ((s2.drm_cycles * 
>>>>>>>>s2.drm_max_freq) - (s1.drm_cycles * s1.drm_max_freq)) / 
>>>>>>>>period
>>>>>>>>
>>>>>>>>... considering the drm_max_freq may change from one 
>>>>>>>>call to the other.
>>>>>>>>if we simplify it and assume it doesn't change:
>>>>>>>>
>>>>>>>>    engine_utilization = ((s2.drm_cycles - 
>>>>>>>>s1.drm_cycles) * drm_max_freq) / period
>>>>>>>>
>>>>>>>>we'd need different drm_max_freq reported on VF driver 
>>>>>>>>that would need
>>>>>>>>to know the number of VFs enabled to scaled it 
>>>>>>>>correctly. Maybe this is
>>>>>>>>abusing the "drm-maxfreq" a little bit?
>>>>>>>
>>>>>>>Yes it would be bad if the observed VF GPU clock will be 
>>>>>>>variable since maxfreq is supposed to be static.
>>>>>>>
>>>>>>>So on VFs would reported GPU clock moves by the VF "used" quanta?
>>>>>>
>>>>>>s/used/available/. That's my understanding, yes. Each VF has a quanta
>>>>>>and the gpu clock moves according to that quanta.  Note that 
>>>>>>as I said,
>>>>>>this is not the case right now (we are just reading 
>>>>>>RING_TIMESTAMP), but
>>>>>>the intention is to have the UAPI side ready so it's already prepared
>>>>>>for that.
>>>>>>
>>>>>>>Where "used" is defined as time given by the GuC, not 
>>>>>>>necessarily used
>>>>>>
>>>>>>s/used/available/ as above
>>>>>>
>>>>>>>GPU time. For instance 16ms quanta, VF GPU clock would 
>>>>>>>move by 16ms if the GuC decides not to switch out the idle 
>>>>>>>VF? Or it could move by less than 16ms if it switched it 
>>>>>>>out earlier.
>>>>>>
>>>>>>no, afaiu it's 16ms, not less. But the quanta depends on the number of
>>>>>>VFs enabled, which may change in runtime.
>>>>>>
>>>>>>I'm not 100% certain and people in Cc may correct me.
>>>>>>
>>>>>>>
>>>>>>>>What if we had
>>>>>>>>
>>>>>>>>drm-cycles-<keystr>: <uint>
>>>>>>>>drm-total-cycles-<keystr>: <uint>
>>>>>>>>
>>>>>>>>Then the utilization can be done:
>>>>>>>>
>>>>>>>>    s1 = sample()
>>>>>>>>    sleep(period)
>>>>>>>>    s2 = sample()
>>>>>>>>    engine_utilization = (s2.cycles - s1.cycles) / \
>>>>>>>>                 (s2.total_cycles - s1.total_cycles + 1);
>>>>>>>>
>>>>>>>>Capacity still to be added above, but we'd need to clarify if
>>>>>>>>drm-total-cycles-<keystr> already accounts for it.
>>>>>>>>
>>>>>>>>Here instead of the conversion to cpu clock, I'm expecting to read
>>>>>>>>"total_cycles" from HW and that being different (slower) for VF.
>>>>>>>>AFAICS this is not the case with this current polling implementation
>>>>>>>>since we are simply reading the RING_TIMESTAMP, but 
>>>>>>>>there are planned
>>>>>>>>changes to get it from GuC. Umesh/Michal Cc'ed may know better.
>>>>>>>
>>>>>>>I think this works and is clean.
>>>>>>>
>>>>>>>Although I have some doubts about the usefulness on VFs, 
>>>>>>>if the clock movements are at the mercy of the GuC 
>>>>>>>scheduler. Like what does 100% mean for a VF? Maybe it was 
>>>>>>>full quanta, or maybe it was half a quanta if GuC decided 
>>>>>>>to switch it out early, either due going idle or due some 
>>>>>>>other scheduling decision.
>>>>>>
>>>>>>in the scenario you described above the quanta could change 
>>>>>>according to
>>>>>>the scheduler and 100% wouldn't mean much. That's not my 
>>>>>>understanding.
>>>>>>100% always mean the VF used all the allocated time. I see this line
>>>>>>potentially getting blurred a little bit if the scheduler tries to
>>>>>>maximize the HW usage and distribute quanta unevenly, but I think the
>>>>>>interface already contemplates that.
>>>>>>
>>>>>>Another case is the VF not being able to reach 100% because the PF is
>>>>>>submitting high prio work. But I still think the current interface is
>>>>>>sufficient and it's the implementation by GuC/HW that could 
>>>>>>be improved
>>>>>>(e.g. adapting the gpu time reported).
>>>>>>
>>>>>>Michal / Umesh, please chime in if that is not accurate.
>>>>>>
>>>>>
>>>>>Irrespective of how much quanta a VF used, all calculations 
>>>>>will be based on the quanta that it was allocated. That way 
>>>>>the VF would know that it could have better utilized the 
>>>>>allotted time if busyness is less than 100. This does result 
>>>>>in more than 100% usage for a VF that was resource hungry and 
>>>>>scheduling policies allowed it to run more than the allotted 
>>>>>quanta, but this is a known limitation of the solution 
>>>>>provided by GuC. When looking at the overall system (say from 
>>>>>a PF), the usage should still add up to 100%.
>>>>>
>>>>>>
>>>>>>Thinking out loud: IFF the execution quanta is available for 
>>>>>>VF to query
>>>>>
>>>>>For the VF, GuC intends to provide a factor that can be used 
>>>>>to scale the wall time and deduce the VF quanta. This scaled 
>>>>>value is used as the second counter in a VF.
>>>>>>and we are ok with just scaling drm-maxfreq, then maybe we could even
>>>>>>just use the current interface instead of adding a third one. Although
>>>>>>it could be confusing to have a that freq changing.
>>>>>
>>>>>Assuming you are talking about using the drm-cycles and 
>>>>>drm-max-freq. One of the concerns when supporting VFs was that 
>>>>>we cannot actually export busyness in absolute time units to 
>>>>>the user because the GPU is shared across VFs. If we scale the 
>>>>>busyness such that it is stretched across CPU time, then it 
>>>>>helps get the right busyness % relative to CPU time, but the 
>>>>>value of busyness in time units itself is false. This was the 
>>>>>primary reason to use 2 "unitless" counters.
>>>>>
>>>>>fwiu, I think by using the drm-maxfreq, you are going to bring 
>>>>>the same concept back in the discussion - exporting busyness 
>>>>>in time units. Not sure if that's a good idea. Let me know if 
>>>>>I got that wrong.
>>>>
>>>>no, but I think it would still work if we can scale the freq 
>>>>according to
>>>>the quanta.  But that's probably abusing the interface.
>>>>
>>>>Anyway I think we are settling on
>>>>
>>>>    drm-cycles-<engineclass>
>>>>    drm-total-cycles-<engineclass>
>>>>
>>>>so I will start changing the patches and igt while checking this for
>>>>more feedback if any.
>>>Another option came to mind - expose a quanta ratio as a new key. 
>>>Given it has no security implications and is to be used only for 
>>>calculating real VFs GPU utilisation. Like:
>>>
>>>drm-engine-*: <uint> ns
>>>drm-engine-time-ratio-*: <float>
>>>
>>>Unit would be a ratio of time over quanta, for instance 1000ms / 
>>>100ms = 10. Which would mean scale down reported time by 10 when 
>>>comparing against wall time.
>>>
>>>New key would only appear on VFs. Otherwise assumed 1.
>>>
>>>Or could be avoided per engine and just have single global:
>>>
>>>drm-gpu-time-ratio: <float>
>>>
>>>Am I missing something or that could work? It would have the same 
>>>problem as above mentioned "could go over 100%" is one. I mean 
>>>this comment:
>>
>>I am hesitant to expose the quanta ratio at this level. We get that 
>>from GuC and that interface could potentially change. If that 
>>happens, I'd prefer that the uApi is unaffected.
>
>FWIW this idea was simply a time scale factor and I don't think it 
>should have a connection to any GuC implementation details. In other 
>words the observable result of time*scale/elapsed-time vs 
>ticks/total-ticks should be the same. At least with the semantics that 
>were discussed in this thread.
>

For the drm client busyness that we are discussing here, the VF specific 
interface is still wip. The quanta ratio is more of a engine busyness 
concept (see below) that I presume will be used here as well.

>>>
>>>"""
>>>Irrespective of how much quanta a VF used, all calculations will be
>>>based on the quanta that it was allocated.
>>>...
>>>This does result in more than 100% usage for a VF that
>>>was resource hungry and scheduling policies allowed it to run more
>>>than the allotted quanta,
>>>"""
>>>
>>>I read that as total-ticks would never be reported as above the 
>>>configure quanta - always equal.
>>
>>Correct
>>
>>>Maybe I misunderstood?
>>>
>>>Second topic - are there any plans to allow the PF to monitor VF 
>>>GPU utilisation? That wouldn't work via fdinfo aggregation since 
>>>VF clients will not be visibible in a PF. But it sounds like a 
>>>basic and important use case.
>>
>>wip. Engine utilization will be available per-VF from PF.
>
>How it will be exposed out of curiosity?

This would be supported only at the engine level granularity. Client 
level, like you mention, is not possible.

Like here, the % engine busyness is exposed as 2 counters - 
engine_runtime (how long the engine ran) and total_ticks (how long the 
PF/VF ran).  Internally GuC provides the engine_runtime in ticks and KMD 
derives total_ticks from quanta_ratio and elapsed_cpu_time.

That information is available to PF as an array of functions that can be 
indexed using function index.

A global busyness is also available (aggregate of all functions).

Reg: KMD<->UMD interface,

1) 2 counters are exported via the PMU interface.
2) Function index will be part of the config bitmap used with the 
perf_event_open.

I don't know yet if we will end up using kernel perf interface for this 
on XE (separate topic though).

Regards,
Umesh
>
>Regards,
>
>Tvrtko