[PATCH 0/7] drm/xe: Per client usage
Umesh Nerlige Ramappa
umesh.nerlige.ramappa at intel.com
Mon Apr 22 17:17:57 UTC 2024
On Mon, Apr 22, 2024 at 11:40:33AM +0100, Tvrtko Ursulin wrote:
>
>On 20/04/2024 00:51, Umesh Nerlige Ramappa wrote:
>>On Fri, Apr 19, 2024 at 11:44:46AM +0100, Tvrtko Ursulin wrote:
>>>
>>>On 18/04/2024 00:19, Lucas De Marchi wrote:
>>>>On Wed, Apr 17, 2024 at 01:35:29PM GMT, Umesh Nerlige Ramappa wrote:
>>>>>On Wed, Apr 17, 2024 at 02:05:40PM -0500, Lucas De Marchi wrote:
>>>>>>On Wed, Apr 17, 2024 at 09:51:42AM GMT, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>>On 16/04/2024 19:29, Lucas De Marchi wrote:
>>>>>>>>On Tue, Apr 16, 2024 at 03:22:21PM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>>On 16/04/2024 14:51, Lucas De Marchi wrote:
>>>>>>>>>>Forgot to Cc Michal, doing now.
>>>>>>>>>>
>>>>>>>>>>Lucas De Marchi
>>>>>>>>>>
>>>>>>>>>>On Tue, Apr 16, 2024 at 08:30:33AM -0500, Lucas De Marchi wrote:
>>>>>>>>>>>On Tue, Apr 16, 2024 at 09:37:44AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>On 16/04/2024 04:04, Lucas De Marchi wrote:
>>>>>>>>>>>>>Add per-client usage statistics to xe. This
>>>>>>>>>>>>>ports xe to use the common
>>>>>>>>>>>>>method in drm to export the usage to userspace
>>>>>>>>>>>>>per client (where 1
>>>>>>>>>>>>>client == 1 drm fd open).
>>>>>>>>>>>>>
>>>>>>>>>>>>>However insted of using the current format,
>>>>>>>>>>>>>this creates a new one with
>>>>>>>>>>>>>the unit "ticks". The intention here is not to
>>>>>>>>>>>>>mix the GPU clock domain
>>>>>>>>>>>>>with the CPU clock. It allows to cover a few
>>>>>>>>>>>>>more use cases without
>>>>>>>>>>>>>extra complications.
>>>>>>>>>>>>>
>>>>>>>>>>>>>Last patch was a quick implemenation of a
>>>>>>>>>>>>>gputop-like tool in python.
>>>>>>>>>>>>>I ended doing it to cross check the gputop
>>>>>>>>>>>>>implementation. I's not
>>>>>>>>>>>>>really meant to be applied here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>I tested this on DG2 and TGL with kmscube
>>>>>>>>>>>>>(console-only) and vkcube
>>>>>>>>>>>>>(in a gnome session), but it would be good to
>>>>>>>>>>>>>soak this under more
>>>>>>>>>>>>>tests. The biggest goal for this patch series
>>>>>>>>>>>>>right now is to get
>>>>>>>>>>>>>consensus on the new UAPI.
>>>>>>>>>>>>>
>>>>>>>>>>>>>TODO: Add documentation on top with the new interface.
>>>>>>>>>>>>
>>>>>>>>>>>>Yeah a drm-usage-stats.rst patch would be nice
>>>>>>>>>>>>to have in the RFC so one does not have to look
>>>>>>>>>>>>into the driver implementation to discuss the
>>>>>>>>>>>>proposed uapi.
>>>>>>>>>>>>
>>>>>>>>>>>>Nevertheless I understand the proposal is to add this:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks
>>>>>>>>>>>
>>>>>>>>>>>yes, the gputop patch was more explicit about
>>>>>>>>>>>this. Should had added in
>>>>>>>>>>>the kernel patch series too.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>That's two values per key. I guess "one key
>>>>>>>>>>>>value pair for per one line of text" does not
>>>>>>>>>>>>get strictly broken and that you propose a
>>>>>>>>>>>>heuristics in parsing to detect that the
>>>>>>>>>>>><RUNTIME> cannot be mis-interpreted as the unit?
>>>>>>>>>>>
>>>>>>>>>>>the current format is
>>>>>>>>>>>
>>>>>>>>>>> drm-engine-<class>: <RUNTIME> ns
>>>>>>>>>>>
>>>>>>>>>>>the "ns" in the end should be parsed by userspace to know
>>>>>>>>>>>what it is about.
>>>>>>>>>
>>>>>>>>>Right.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Not sure it is a good idea though. If you
>>>>>>>>>>>>instead added a new key for the gpu time what
>>>>>>>>>>>>would be the downside in your view? Like:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <uint> ticks
>>>>>>>>>>>>drm-ticks-<class>: <uint>
>>>>>>>>>>>>
>>>>>>>>>>>>Or maybe even obsfuscate/generalise as:
>>>>>>>>>>>>
>>>>>>>>>>>>drm-engine-<class>: <uint> gpu-time
>>>>>>>>>>>>drm-gpu-time-<class>: <uint>
>>>>>>>>>>>
>>>>>>>>>>>I think both work, but I fail to see the
>>>>>>>>>>>advantage. This alternative is
>>>>>>>>>>>slightly heavier on the parsing side since you
>>>>>>>>>>>have to correlate the
>>>>>>>>>>>values from 2 keys, possibly dealing with them
>>>>>>>>>>>appearing in different
>>>>>>>>>>>order. The only possible breakage remains with
>>>>>>>>>>>this alternative: if
>>>>>>>>>>>userspace didn't parse the unit before. I checked
>>>>>>>>>>>nvtop and htop and
>>>>>>>>>>>they were doing the right thing. I sent a fix to
>>>>>>>>>>>igt a few weeks back
>>>>>>>>>>>for it to consider the unit:
>>>>>>>>>>>https://lore.kernel.org/igt-dev/20240405060056.59379-8-lucas.demarchi@intel.com/
>>>>>>>>>
>>>>>>>>>Advantages are that "drm-engine-something: 1234 5678
>>>>>>>>>ticks" isn't self-explanatory (intuitively humanly
>>>>>>>>>readable) and that it doesn't
>>>>>>>>
>>>>>>>>maybe I have a different expectation from procfs. When I do e.g.
>>>>>>>>
>>>>>>>># cat /proc/self/stat
>>>>>>>>3861283 (cat) R 3861233 3861283 3861231 34816 3861283
>>>>>>>>4194304 90 0 0 0 0 0 0 0 20 0 1 0 1321348797 8560640 384
>>>>>>>>18446744073709551615 93979016876032 93979016892449
>>>>>>>>140720658378704 0 0 0 0 0 0 0 0 0 17 51 0 0 0 0 0
>>>>>>>>93979016907440 93979016908904 93979037196288
>>>>>>>>140720658380605 140720658380625 140720658380625
>>>>>>>>140720658382827 0
>>>>>>>>
>>>>>>>>it doesn't seem to me "intuitively humanly readable" was the first
>>>>>>>>concern for people adding files in procfs :)... I'd
>>>>>>>>rather think "machine
>>>>>>>>readable" was more important.
>>>>>>>
>>>>>>>I think you are pushing the argument a bit now :) since
>>>>>>>IMO we should evaluate drm-usage-stats.rst proposal more
>>>>>>>in the context of drm-usage-stats and other fdinfo files,
>>>>>>>rather than the whole of procfs. In other words if there
>>>>>>>isn't a strong reason to regress this particular file lets
>>>>>>>not do it.
>>>>>>
>>>>>>:) I like pushing arguments if it helps revisit decisions (human vs
>>>>>>machine readable for things in procfs). I'm not
>>>>>>trying to push the 2 counter approaches though. I think other reasons
>>>>>>like discussed below are enough to consider the other keys.
>>>>>>
>>>>>>TBH I was reluctant at first to add a separate uapi rather
>>>>>>than re-using
>>>>>>drm-engine- without realizing there was already a second one (not
>>>>>>implemented in gputop).
>>>>>>
>>>>>>So AFAICS i915 and amdgpu use drm-engine-. msm and panfrost use
>>>>>>drm-cycles + drm-maxfreq. And none of them seem suitable to xe.
>>>>>>
>>>>>>>
>>>>>>>>>diverge from the one value per key plus unit format.
>>>>>>>>>Latter we would then document clearly.
>>>>>>>>>
>>>>>>>>>Different keys potentially appearing in different
>>>>>>>>>order does not matter since userspace already has to
>>>>>>>>>handle that.
>>>>>>>>>
>>>>>>>>>>>>Potentially could also add a key saying how much
>>>>>>>>>>>>wall time is one unit of GPU time.
>>>>>>>>>>>
>>>>>>>>>>>I wouldn't add it really as it may not make sense
>>>>>>>>>>>depending on the
>>>>>>>>>>>vendor and or usage. Examples: the gpu time may be different for
>>>>>>>>>>>different engines depending on where they are
>>>>>>>>>>>located (tile/gt). The
>>>>>>>>>>>correlation with CPU time is different when
>>>>>>>>>>>running in VF mode, and may
>>>>>>>>>>>change in runtime depending on the number of VFs. +Michal.
>>>>>>>>>
>>>>>>>>>Yes, that's why I said "potentially", which was
>>>>>>>>>supposed to mean if and where it makes sense and
>>>>>>>>>perhaps adds value.
>>>>>>>>>
>>>>>>>>>>>Also, if the userspace side really wants to know (why would it?)
>>>>>>>>>>>it could be just calculate from 2 samples
>>>>>>>>>>>(possibly repeated a few
>>>>>>>>>>>times as it updates the output).
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Or.. would even the existing drm-cycles, plus
>>>>>>>>>>>>abuse of drm-maxfreq, work? Ticks == cycles,
>>>>>>>>>>>>maxfreq == ticks per wall second.
>>>>>>>>>>>
>>>>>>>>>>>I think it'd be up to gpu vendor what clock backs
>>>>>>>>>>>this time. For the
>>>>>>>>>>>current Intel cards, it's the refclock and it doesn't vary the
>>>>>>>>>>>frequency.
>>>>>>>>>
>>>>>>>>>Right, but that doesn't matter. What I was saying is
>>>>>>>>>that if you exposed ticks in drm-cycles and tick
>>>>>>>>>frequency in drm-maxfreq it would even work, no?
>>>>>>>>>(Assuming support for those two was actually in
>>>>>>>>>igt_drm_fdinfo/clients/gputop and could be used as
>>>>>>>>>fallback to time based stats.)
>>>>>>>>
>>>>>>>>oh... I was looking at the output for i915 and missed that we had
>>>>>>>>drm-cycles as currently i915 doesn't use it. It seems to
>>>>>>>>be a similar
>>>>>>>>thing. I agree the drm-maxfreq- is unfortunate and that we don't
>>>>>>>>actually have support for that in gputop.
>>>>>>>>
>>>>>>>>So, instead of the 2 numbers + different unit, I can adapt this to
>>>>>>>>rather use drm-cycles. However for maxfreq, it doesn't seem to be
>>>>>>>>what we need since it doesn't scale for VF. It brings
>>>>>>>>back the cpu clock
>>>>>>>>domain this series is trying to avoid. The difference is that using
>>>>>>>>drm-cycles- and drm-maxfreq- you are expecting the userspace to do
>>>>>>>>(let me know if I interpreted the docs wrong):
>>>>>>>>
>>>>>>>> s1 = sample()
>>>>>>>> sleep(period)
>>>>>>>> s2 = sample()
>>>>>>>> engine_utilization = ((s2.drm_cycles *
>>>>>>>>s2.drm_max_freq) - (s1.drm_cycles * s1.drm_max_freq)) /
>>>>>>>>period
>>>>>>>>
>>>>>>>>... considering the drm_max_freq may change from one
>>>>>>>>call to the other.
>>>>>>>>if we simplify it and assume it doesn't change:
>>>>>>>>
>>>>>>>> engine_utilization = ((s2.drm_cycles -
>>>>>>>>s1.drm_cycles) * drm_max_freq) / period
>>>>>>>>
>>>>>>>>we'd need different drm_max_freq reported on VF driver
>>>>>>>>that would need
>>>>>>>>to know the number of VFs enabled to scaled it
>>>>>>>>correctly. Maybe this is
>>>>>>>>abusing the "drm-maxfreq" a little bit?
>>>>>>>
>>>>>>>Yes it would be bad if the observed VF GPU clock will be
>>>>>>>variable since maxfreq is supposed to be static.
>>>>>>>
>>>>>>>So on VFs would reported GPU clock moves by the VF "used" quanta?
>>>>>>
>>>>>>s/used/available/. That's my understanding, yes. Each VF has a quanta
>>>>>>and the gpu clock moves according to that quanta. Note that
>>>>>>as I said,
>>>>>>this is not the case right now (we are just reading
>>>>>>RING_TIMESTAMP), but
>>>>>>the intention is to have the UAPI side ready so it's already prepared
>>>>>>for that.
>>>>>>
>>>>>>>Where "used" is defined as time given by the GuC, not
>>>>>>>necessarily used
>>>>>>
>>>>>>s/used/available/ as above
>>>>>>
>>>>>>>GPU time. For instance 16ms quanta, VF GPU clock would
>>>>>>>move by 16ms if the GuC decides not to switch out the idle
>>>>>>>VF? Or it could move by less than 16ms if it switched it
>>>>>>>out earlier.
>>>>>>
>>>>>>no, afaiu it's 16ms, not less. But the quanta depends on the number of
>>>>>>VFs enabled, which may change in runtime.
>>>>>>
>>>>>>I'm not 100% certain and people in Cc may correct me.
>>>>>>
>>>>>>>
>>>>>>>>What if we had
>>>>>>>>
>>>>>>>>drm-cycles-<keystr>: <uint>
>>>>>>>>drm-total-cycles-<keystr>: <uint>
>>>>>>>>
>>>>>>>>Then the utilization can be done:
>>>>>>>>
>>>>>>>> s1 = sample()
>>>>>>>> sleep(period)
>>>>>>>> s2 = sample()
>>>>>>>> engine_utilization = (s2.cycles - s1.cycles) / \
>>>>>>>> (s2.total_cycles - s1.total_cycles + 1);
>>>>>>>>
>>>>>>>>Capacity still to be added above, but we'd need to clarify if
>>>>>>>>drm-total-cycles-<keystr> already accounts for it.
>>>>>>>>
>>>>>>>>Here instead of the conversion to cpu clock, I'm expecting to read
>>>>>>>>"total_cycles" from HW and that being different (slower) for VF.
>>>>>>>>AFAICS this is not the case with this current polling implementation
>>>>>>>>since we are simply reading the RING_TIMESTAMP, but
>>>>>>>>there are planned
>>>>>>>>changes to get it from GuC. Umesh/Michal Cc'ed may know better.
>>>>>>>
>>>>>>>I think this works and is clean.
>>>>>>>
>>>>>>>Although I have some doubts about the usefulness on VFs,
>>>>>>>if the clock movements are at the mercy of the GuC
>>>>>>>scheduler. Like what does 100% mean for a VF? Maybe it was
>>>>>>>full quanta, or maybe it was half a quanta if GuC decided
>>>>>>>to switch it out early, either due going idle or due some
>>>>>>>other scheduling decision.
>>>>>>
>>>>>>in the scenario you described above the quanta could change
>>>>>>according to
>>>>>>the scheduler and 100% wouldn't mean much. That's not my
>>>>>>understanding.
>>>>>>100% always mean the VF used all the allocated time. I see this line
>>>>>>potentially getting blurred a little bit if the scheduler tries to
>>>>>>maximize the HW usage and distribute quanta unevenly, but I think the
>>>>>>interface already contemplates that.
>>>>>>
>>>>>>Another case is the VF not being able to reach 100% because the PF is
>>>>>>submitting high prio work. But I still think the current interface is
>>>>>>sufficient and it's the implementation by GuC/HW that could
>>>>>>be improved
>>>>>>(e.g. adapting the gpu time reported).
>>>>>>
>>>>>>Michal / Umesh, please chime in if that is not accurate.
>>>>>>
>>>>>
>>>>>Irrespective of how much quanta a VF used, all calculations
>>>>>will be based on the quanta that it was allocated. That way
>>>>>the VF would know that it could have better utilized the
>>>>>allotted time if busyness is less than 100. This does result
>>>>>in more than 100% usage for a VF that was resource hungry and
>>>>>scheduling policies allowed it to run more than the allotted
>>>>>quanta, but this is a known limitation of the solution
>>>>>provided by GuC. When looking at the overall system (say from
>>>>>a PF), the usage should still add up to 100%.
>>>>>
>>>>>>
>>>>>>Thinking out loud: IFF the execution quanta is available for
>>>>>>VF to query
>>>>>
>>>>>For the VF, GuC intends to provide a factor that can be used
>>>>>to scale the wall time and deduce the VF quanta. This scaled
>>>>>value is used as the second counter in a VF.
>>>>>>and we are ok with just scaling drm-maxfreq, then maybe we could even
>>>>>>just use the current interface instead of adding a third one. Although
>>>>>>it could be confusing to have a that freq changing.
>>>>>
>>>>>Assuming you are talking about using the drm-cycles and
>>>>>drm-max-freq. One of the concerns when supporting VFs was that
>>>>>we cannot actually export busyness in absolute time units to
>>>>>the user because the GPU is shared across VFs. If we scale the
>>>>>busyness such that it is stretched across CPU time, then it
>>>>>helps get the right busyness % relative to CPU time, but the
>>>>>value of busyness in time units itself is false. This was the
>>>>>primary reason to use 2 "unitless" counters.
>>>>>
>>>>>fwiu, I think by using the drm-maxfreq, you are going to bring
>>>>>the same concept back in the discussion - exporting busyness
>>>>>in time units. Not sure if that's a good idea. Let me know if
>>>>>I got that wrong.
>>>>
>>>>no, but I think it would still work if we can scale the freq
>>>>according to
>>>>the quanta. But that's probably abusing the interface.
>>>>
>>>>Anyway I think we are settling on
>>>>
>>>> drm-cycles-<engineclass>
>>>> drm-total-cycles-<engineclass>
>>>>
>>>>so I will start changing the patches and igt while checking this for
>>>>more feedback if any.
>>>Another option came to mind - expose a quanta ratio as a new key.
>>>Given it has no security implications and is to be used only for
>>>calculating real VFs GPU utilisation. Like:
>>>
>>>drm-engine-*: <uint> ns
>>>drm-engine-time-ratio-*: <float>
>>>
>>>Unit would be a ratio of time over quanta, for instance 1000ms /
>>>100ms = 10. Which would mean scale down reported time by 10 when
>>>comparing against wall time.
>>>
>>>New key would only appear on VFs. Otherwise assumed 1.
>>>
>>>Or could be avoided per engine and just have single global:
>>>
>>>drm-gpu-time-ratio: <float>
>>>
>>>Am I missing something or that could work? It would have the same
>>>problem as above mentioned "could go over 100%" is one. I mean
>>>this comment:
>>
>>I am hesitant to expose the quanta ratio at this level. We get that
>>from GuC and that interface could potentially change. If that
>>happens, I'd prefer that the uApi is unaffected.
>
>FWIW this idea was simply a time scale factor and I don't think it
>should have a connection to any GuC implementation details. In other
>words the observable result of time*scale/elapsed-time vs
>ticks/total-ticks should be the same. At least with the semantics that
>were discussed in this thread.
>
For the drm client busyness that we are discussing here, the VF specific
interface is still wip. The quanta ratio is more of a engine busyness
concept (see below) that I presume will be used here as well.
>>>
>>>"""
>>>Irrespective of how much quanta a VF used, all calculations will be
>>>based on the quanta that it was allocated.
>>>...
>>>This does result in more than 100% usage for a VF that
>>>was resource hungry and scheduling policies allowed it to run more
>>>than the allotted quanta,
>>>"""
>>>
>>>I read that as total-ticks would never be reported as above the
>>>configure quanta - always equal.
>>
>>Correct
>>
>>>Maybe I misunderstood?
>>>
>>>Second topic - are there any plans to allow the PF to monitor VF
>>>GPU utilisation? That wouldn't work via fdinfo aggregation since
>>>VF clients will not be visibible in a PF. But it sounds like a
>>>basic and important use case.
>>
>>wip. Engine utilization will be available per-VF from PF.
>
>How it will be exposed out of curiosity?
This would be supported only at the engine level granularity. Client
level, like you mention, is not possible.
Like here, the % engine busyness is exposed as 2 counters -
engine_runtime (how long the engine ran) and total_ticks (how long the
PF/VF ran). Internally GuC provides the engine_runtime in ticks and KMD
derives total_ticks from quanta_ratio and elapsed_cpu_time.
That information is available to PF as an array of functions that can be
indexed using function index.
A global busyness is also available (aggregate of all functions).
Reg: KMD<->UMD interface,
1) 2 counters are exported via the PMU interface.
2) Function index will be part of the config bitmap used with the
perf_event_open.
I don't know yet if we will end up using kernel perf interface for this
on XE (separate topic though).
Regards,
Umesh
>
>Regards,
>
>Tvrtko
More information about the Intel-xe
mailing list