[PATCH 0/7] drm/xe: Per client usage

Mon Apr 22 10:40:33 UTC 2024

On 20/04/2024 00:51, Umesh Nerlige Ramappa wrote:
> On Fri, Apr 19, 2024 at 11:44:46AM +0100, Tvrtko Ursulin wrote:
>>
>> On 18/04/2024 00:19, Lucas De Marchi wrote:
>>> On Wed, Apr 17, 2024 at 01:35:29PM GMT, Umesh Nerlige Ramappa wrote:
>>>> On Wed, Apr 17, 2024 at 02:05:40PM -0500, Lucas De Marchi wrote:
>>>>> On Wed, Apr 17, 2024 at 09:51:42AM GMT, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 16/04/2024 19:29, Lucas De Marchi wrote:
>>>>>>> On Tue, Apr 16, 2024 at 03:22:21PM +0100, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 16/04/2024 14:51, Lucas De Marchi wrote:
>>>>>>>>> Forgot to Cc Michal, doing now.
>>>>>>>>>
>>>>>>>>> Lucas De Marchi
>>>>>>>>>
>>>>>>>>> On Tue, Apr 16, 2024 at 08:30:33AM -0500, Lucas De Marchi wrote:
>>>>>>>>>> On Tue, Apr 16, 2024 at 09:37:44AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 16/04/2024 04:04, Lucas De Marchi wrote:
>>>>>>>>>>>> Add per-client usage statistics to xe. This ports xe to use 
>>>>>>>>>>>> the common
>>>>>>>>>>>> method in drm to export the usage to userspace per client 
>>>>>>>>>>>> (where 1
>>>>>>>>>>>> client == 1 drm fd open).
>>>>>>>>>>>>
>>>>>>>>>>>> However insted of using the current format, this creates a 
>>>>>>>>>>>> new one with
>>>>>>>>>>>> the unit "ticks". The intention here is not to mix the GPU 
>>>>>>>>>>>> clock domain
>>>>>>>>>>>> with the CPU clock. It allows to cover a few more use cases 
>>>>>>>>>>>> without
>>>>>>>>>>>> extra complications.
>>>>>>>>>>>>
>>>>>>>>>>>> Last patch was a quick implemenation of a gputop-like tool 
>>>>>>>>>>>> in python.
>>>>>>>>>>>> I ended doing it to cross check the gputop implementation. 
>>>>>>>>>>>> I's not
>>>>>>>>>>>> really meant to be applied here.
>>>>>>>>>>>>
>>>>>>>>>>>> I tested this on DG2 and TGL with kmscube (console-only) and 
>>>>>>>>>>>> vkcube
>>>>>>>>>>>> (in a gnome session), but it would be good to soak this 
>>>>>>>>>>>> under more
>>>>>>>>>>>> tests. The biggest goal for this patch series right now is 
>>>>>>>>>>>> to get
>>>>>>>>>>>> consensus on the new UAPI.
>>>>>>>>>>>>
>>>>>>>>>>>> TODO: Add documentation on top with the new interface.
>>>>>>>>>>>
>>>>>>>>>>> Yeah a drm-usage-stats.rst patch would be nice to have in the 
>>>>>>>>>>> RFC so one does not have to look into the driver 
>>>>>>>>>>> implementation to discuss the proposed uapi.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless I understand the proposal is to add this:
>>>>>>>>>>>
>>>>>>>>>>> drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks
>>>>>>>>>>
>>>>>>>>>> yes, the gputop patch was more explicit about this. Should had 
>>>>>>>>>> added in
>>>>>>>>>> the kernel patch series too.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That's two values per key. I guess "one key value pair for 
>>>>>>>>>>> per one line of text" does not get strictly broken and that 
>>>>>>>>>>> you propose a heuristics in parsing to detect that the 
>>>>>>>>>>> <RUNTIME> cannot be mis-interpreted as the unit?
>>>>>>>>>>
>>>>>>>>>> the current format is
>>>>>>>>>>
>>>>>>>>>>     drm-engine-<class>: <RUNTIME> ns
>>>>>>>>>>
>>>>>>>>>> the "ns" in the end should be parsed by userspace to know
>>>>>>>>>> what it is about.
>>>>>>>>
>>>>>>>> Right.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Not sure it is a good idea though. If you instead added a new 
>>>>>>>>>>> key for the gpu time what would be the downside in your view? 
>>>>>>>>>>> Like:
>>>>>>>>>>>
>>>>>>>>>>> drm-engine-<class>: <uint> ticks
>>>>>>>>>>> drm-ticks-<class>: <uint>
>>>>>>>>>>>
>>>>>>>>>>> Or maybe even obsfuscate/generalise as:
>>>>>>>>>>>
>>>>>>>>>>> drm-engine-<class>: <uint> gpu-time
>>>>>>>>>>> drm-gpu-time-<class>: <uint>
>>>>>>>>>>
>>>>>>>>>> I think both work, but I fail to see the advantage. This 
>>>>>>>>>> alternative is
>>>>>>>>>> slightly heavier on the parsing side since you have to 
>>>>>>>>>> correlate the
>>>>>>>>>> values from 2 keys, possibly dealing with them appearing in 
>>>>>>>>>> different
>>>>>>>>>> order. The only possible breakage remains with this 
>>>>>>>>>> alternative: if
>>>>>>>>>> userspace didn't parse the unit before. I checked nvtop and 
>>>>>>>>>> htop and
>>>>>>>>>> they were doing the right thing. I sent a fix to igt a few 
>>>>>>>>>> weeks back
>>>>>>>>>> for it to consider the unit:
>>>>>>>>>> https://lore.kernel.org/igt-dev/20240405060056.59379-8-lucas.demarchi@intel.com/
>>>>>>>>
>>>>>>>> Advantages are that "drm-engine-something: 1234 5678 ticks" 
>>>>>>>> isn't self-explanatory (intuitively humanly readable) and that 
>>>>>>>> it doesn't
>>>>>>>
>>>>>>> maybe I have a different expectation from procfs. When I do e.g.
>>>>>>>
>>>>>>> # cat /proc/self/stat
>>>>>>> 3861283 (cat) R 3861233 3861283 3861231 34816 3861283 4194304 90 
>>>>>>> 0 0 0 0 0 0 0 20 0 1 0 1321348797 8560640 384 
>>>>>>> 18446744073709551615 93979016876032 93979016892449 
>>>>>>> 140720658378704 0 0 0 0 0 0 0 0 0 17 51 0 0 0 0 0 93979016907440 
>>>>>>> 93979016908904 93979037196288 140720658380605 140720658380625 
>>>>>>> 140720658380625 140720658382827 0
>>>>>>>
>>>>>>> it doesn't seem to me "intuitively humanly readable" was the first
>>>>>>> concern for people adding files in procfs :)... I'd rather think 
>>>>>>> "machine
>>>>>>> readable" was more important.
>>>>>>
>>>>>> I think you are pushing the argument a bit now :) since IMO we 
>>>>>> should evaluate drm-usage-stats.rst proposal more in the context 
>>>>>> of drm-usage-stats and other fdinfo files, rather than the whole 
>>>>>> of procfs. In other words if there isn't a strong reason to 
>>>>>> regress this particular file lets not do it.
>>>>>
>>>>> :) I like pushing arguments if it helps revisit decisions (human vs
>>>>> machine readable for things in procfs). I'm not
>>>>> trying to push the 2 counter approaches though. I think other reasons
>>>>> like discussed below are enough to consider the other keys.
>>>>>
>>>>> TBH I was reluctant at first to add a separate uapi rather than 
>>>>> re-using
>>>>> drm-engine- without realizing there was already a second one (not
>>>>> implemented in gputop).
>>>>>
>>>>> So AFAICS i915 and amdgpu use drm-engine-. msm and panfrost use
>>>>> drm-cycles + drm-maxfreq.  And none of them seem suitable to xe.
>>>>>
>>>>>>
>>>>>>>> diverge from the one value per key plus unit format. Latter we 
>>>>>>>> would then document clearly.
>>>>>>>>
>>>>>>>> Different keys potentially appearing in different order does not 
>>>>>>>> matter since userspace already has to handle that.
>>>>>>>>
>>>>>>>>>>> Potentially could also add a key saying how much wall time is 
>>>>>>>>>>> one unit of GPU time.
>>>>>>>>>>
>>>>>>>>>> I wouldn't add it really as it may not make sense depending on 
>>>>>>>>>> the
>>>>>>>>>> vendor and or usage. Examples: the gpu time may be different for
>>>>>>>>>> different engines depending on where they are located 
>>>>>>>>>> (tile/gt). The
>>>>>>>>>> correlation with CPU time is different when running in VF 
>>>>>>>>>> mode, and may
>>>>>>>>>> change in runtime depending on the number of VFs. +Michal.
>>>>>>>>
>>>>>>>> Yes, that's why I said "potentially", which was supposed to mean 
>>>>>>>> if and where it makes sense and perhaps adds value.
>>>>>>>>
>>>>>>>>>> Also, if the userspace side really wants to know (why would it?)
>>>>>>>>>> it could be just calculate from 2 samples (possibly repeated a 
>>>>>>>>>> few
>>>>>>>>>> times as it updates the output).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Or.. would even the existing drm-cycles, plus abuse of 
>>>>>>>>>>> drm-maxfreq, work? Ticks == cycles, maxfreq == ticks per wall 
>>>>>>>>>>> second.
>>>>>>>>>>
>>>>>>>>>> I think it'd be up to gpu vendor what clock backs this time. 
>>>>>>>>>> For the
>>>>>>>>>> current Intel cards, it's the refclock and it doesn't vary the
>>>>>>>>>> frequency.
>>>>>>>>
>>>>>>>> Right, but that doesn't matter. What I was saying is that if you 
>>>>>>>> exposed ticks in drm-cycles and tick frequency in drm-maxfreq it 
>>>>>>>> would even work, no? (Assuming support for those two was 
>>>>>>>> actually in igt_drm_fdinfo/clients/gputop and could be used as 
>>>>>>>> fallback to time based stats.)
>>>>>>>
>>>>>>> oh... I was looking at the output for i915 and missed that we had
>>>>>>> drm-cycles as currently i915 doesn't use it. It seems to be a 
>>>>>>> similar
>>>>>>> thing. I agree the drm-maxfreq- is unfortunate and that we don't
>>>>>>> actually have support for that in gputop.
>>>>>>>
>>>>>>> So, instead of the 2 numbers + different unit, I can adapt this to
>>>>>>> rather use drm-cycles. However for maxfreq, it doesn't seem to be
>>>>>>> what we need since it doesn't scale for VF. It brings back the 
>>>>>>> cpu clock
>>>>>>> domain this series is trying to avoid. The difference is that using
>>>>>>> drm-cycles- and drm-maxfreq- you are expecting the userspace to do
>>>>>>> (let me know if I interpreted the docs wrong):
>>>>>>>
>>>>>>>     s1 = sample()
>>>>>>>     sleep(period)
>>>>>>>     s2 = sample()
>>>>>>>     engine_utilization = ((s2.drm_cycles * s2.drm_max_freq) - 
>>>>>>> (s1.drm_cycles * s1.drm_max_freq)) / period
>>>>>>>
>>>>>>> ... considering the drm_max_freq may change from one call to the 
>>>>>>> other.
>>>>>>> if we simplify it and assume it doesn't change:
>>>>>>>
>>>>>>>     engine_utilization = ((s2.drm_cycles - s1.drm_cycles) * 
>>>>>>> drm_max_freq) / period
>>>>>>>
>>>>>>> we'd need different drm_max_freq reported on VF driver that would 
>>>>>>> need
>>>>>>> to know the number of VFs enabled to scaled it correctly. Maybe 
>>>>>>> this is
>>>>>>> abusing the "drm-maxfreq" a little bit?
>>>>>>
>>>>>> Yes it would be bad if the observed VF GPU clock will be variable 
>>>>>> since maxfreq is supposed to be static.
>>>>>>
>>>>>> So on VFs would reported GPU clock moves by the VF "used" quanta?
>>>>>
>>>>> s/used/available/. That's my understanding, yes. Each VF has a quanta
>>>>> and the gpu clock moves according to that quanta.  Note that as I 
>>>>> said,
>>>>> this is not the case right now (we are just reading 
>>>>> RING_TIMESTAMP), but
>>>>> the intention is to have the UAPI side ready so it's already prepared
>>>>> for that.
>>>>>
>>>>>> Where "used" is defined as time given by the GuC, not necessarily 
>>>>>> used
>>>>>
>>>>> s/used/available/ as above
>>>>>
>>>>>> GPU time. For instance 16ms quanta, VF GPU clock would move by 
>>>>>> 16ms if the GuC decides not to switch out the idle VF? Or it could 
>>>>>> move by less than 16ms if it switched it out earlier.
>>>>>
>>>>> no, afaiu it's 16ms, not less. But the quanta depends on the number of
>>>>> VFs enabled, which may change in runtime.
>>>>>
>>>>> I'm not 100% certain and people in Cc may correct me.
>>>>>
>>>>>>
>>>>>>> What if we had
>>>>>>>
>>>>>>> drm-cycles-<keystr>: <uint>
>>>>>>> drm-total-cycles-<keystr>: <uint>
>>>>>>>
>>>>>>> Then the utilization can be done:
>>>>>>>
>>>>>>>     s1 = sample()
>>>>>>>     sleep(period)
>>>>>>>     s2 = sample()
>>>>>>>     engine_utilization = (s2.cycles - s1.cycles) / \
>>>>>>>                  (s2.total_cycles - s1.total_cycles + 1);
>>>>>>>
>>>>>>> Capacity still to be added above, but we'd need to clarify if
>>>>>>> drm-total-cycles-<keystr> already accounts for it.
>>>>>>>
>>>>>>> Here instead of the conversion to cpu clock, I'm expecting to read
>>>>>>> "total_cycles" from HW and that being different (slower) for VF.
>>>>>>> AFAICS this is not the case with this current polling implementation
>>>>>>> since we are simply reading the RING_TIMESTAMP, but there are 
>>>>>>> planned
>>>>>>> changes to get it from GuC. Umesh/Michal Cc'ed may know better.
>>>>>>
>>>>>> I think this works and is clean.
>>>>>>
>>>>>> Although I have some doubts about the usefulness on VFs, if the 
>>>>>> clock movements are at the mercy of the GuC scheduler. Like what 
>>>>>> does 100% mean for a VF? Maybe it was full quanta, or maybe it was 
>>>>>> half a quanta if GuC decided to switch it out early, either due 
>>>>>> going idle or due some other scheduling decision.
>>>>>
>>>>> in the scenario you described above the quanta could change 
>>>>> according to
>>>>> the scheduler and 100% wouldn't mean much. That's not my 
>>>>> understanding.
>>>>> 100% always mean the VF used all the allocated time. I see this line
>>>>> potentially getting blurred a little bit if the scheduler tries to
>>>>> maximize the HW usage and distribute quanta unevenly, but I think the
>>>>> interface already contemplates that.
>>>>>
>>>>> Another case is the VF not being able to reach 100% because the PF is
>>>>> submitting high prio work. But I still think the current interface is
>>>>> sufficient and it's the implementation by GuC/HW that could be 
>>>>> improved
>>>>> (e.g. adapting the gpu time reported).
>>>>>
>>>>> Michal / Umesh, please chime in if that is not accurate.
>>>>>
>>>>
>>>> Irrespective of how much quanta a VF used, all calculations will be 
>>>> based on the quanta that it was allocated. That way the VF would 
>>>> know that it could have better utilized the allotted time if 
>>>> busyness is less than 100. This does result in more than 100% usage 
>>>> for a VF that was resource hungry and scheduling policies allowed it 
>>>> to run more than the allotted quanta, but this is a known limitation 
>>>> of the solution provided by GuC. When looking at the overall system 
>>>> (say from a PF), the usage should still add up to 100%.
>>>>
>>>>>
>>>>> Thinking out loud: IFF the execution quanta is available for VF to 
>>>>> query
>>>>
>>>> For the VF, GuC intends to provide a factor that can be used to 
>>>> scale the wall time and deduce the VF quanta. This scaled value is 
>>>> used as the second counter in a VF.
>>>>> and we are ok with just scaling drm-maxfreq, then maybe we could even
>>>>> just use the current interface instead of adding a third one. Although
>>>>> it could be confusing to have a that freq changing.
>>>>
>>>> Assuming you are talking about using the drm-cycles and 
>>>> drm-max-freq. One of the concerns when supporting VFs was that we 
>>>> cannot actually export busyness in absolute time units to the user 
>>>> because the GPU is shared across VFs. If we scale the busyness such 
>>>> that it is stretched across CPU time, then it helps get the right 
>>>> busyness % relative to CPU time, but the value of busyness in time 
>>>> units itself is false. This was the primary reason to use 2 
>>>> "unitless" counters.
>>>>
>>>> fwiu, I think by using the drm-maxfreq, you are going to bring the 
>>>> same concept back in the discussion - exporting busyness in time 
>>>> units. Not sure if that's a good idea. Let me know if I got that wrong.
>>>
>>> no, but I think it would still work if we can scale the freq 
>>> according to
>>> the quanta.  But that's probably abusing the interface.
>>>
>>> Anyway I think we are settling on
>>>
>>>     drm-cycles-<engineclass>
>>>     drm-total-cycles-<engineclass>
>>>
>>> so I will start changing the patches and igt while checking this for
>>> more feedback if any.
>> Another option came to mind - expose a quanta ratio as a new key. 
>> Given it has no security implications and is to be used only for 
>> calculating real VFs GPU utilisation. Like:
>>
>> drm-engine-*: <uint> ns
>> drm-engine-time-ratio-*: <float>
>>
>> Unit would be a ratio of time over quanta, for instance 1000ms / 100ms 
>> = 10. Which would mean scale down reported time by 10 when comparing 
>> against wall time.
>>
>> New key would only appear on VFs. Otherwise assumed 1.
>>
>> Or could be avoided per engine and just have single global:
>>
>> drm-gpu-time-ratio: <float>
>>
>> Am I missing something or that could work? It would have the same 
>> problem as above mentioned "could go over 100%" is one. I mean this 
>> comment:
> 
> I am hesitant to expose the quanta ratio at this level. We get that from 
> GuC and that interface could potentially change. If that happens, I'd 
> prefer that the uApi is unaffected.

FWIW this idea was simply a time scale factor and I don't think it 
should have a connection to any GuC implementation details. In other 
words the observable result of time*scale/elapsed-time vs 
ticks/total-ticks should be the same. At least with the semantics that 
were discussed in this thread.

>>
>> """
>> Irrespective of how much quanta a VF used, all calculations will be
>> based on the quanta that it was allocated.
>> ...
>> This does result in more than 100% usage for a VF that
>> was resource hungry and scheduling policies allowed it to run more
>> than the allotted quanta,
>> """
>>
>> I read that as total-ticks would never be reported as above the 
>> configure quanta - always equal.
> 
> Correct
> 
>> Maybe I misunderstood?
>>
>> Second topic - are there any plans to allow the PF to monitor VF GPU 
>> utilisation? That wouldn't work via fdinfo aggregation since VF 
>> clients will not be visibible in a PF. But it sounds like a basic and 
>> important use case.
> 
> wip. Engine utilization will be available per-VF from PF.

How it will be exposed out of curiosity?

Regards,

Tvrtko