[PATCH 0/7] drm/xe: Per client usage

Tue Apr 16 13:30:33 UTC 2024

On Tue, Apr 16, 2024 at 09:37:44AM +0100, Tvrtko Ursulin wrote:
>
>On 16/04/2024 04:04, Lucas De Marchi wrote:
>>Add per-client usage statistics to xe. This ports xe to use the common
>>method in drm to export the usage to userspace per client (where 1
>>client == 1 drm fd open).
>>
>>However insted of using the current format, this creates a new one with
>>the unit "ticks". The intention here is not to mix the GPU clock domain
>>with the CPU clock. It allows to cover a few more use cases without
>>extra complications.
>>
>>Last patch was a quick implemenation of a gputop-like tool in python.
>>I ended doing it to cross check the gputop implementation. I's not
>>really meant to be applied here.
>>
>>I tested this on DG2 and TGL with kmscube (console-only) and vkcube
>>(in a gnome session), but it would be good to soak this under more
>>tests. The biggest goal for this patch series right now is to get
>>consensus on the new UAPI.
>>
>>TODO: Add documentation on top with the new interface.
>
>Yeah a drm-usage-stats.rst patch would be nice to have in the RFC so 
>one does not have to look into the driver implementation to discuss 
>the proposed uapi.
>
>Nevertheless I understand the proposal is to add this:
>
> drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks

yes, the gputop patch was more explicit about this. Should had added in
the kernel patch series too.

>
>That's two values per key. I guess "one key value pair for per one 
>line of text" does not get strictly broken and that you propose a 
>heuristics in parsing to detect that the <RUNTIME> cannot be 
>mis-interpreted as the unit?

the current format is

	drm-engine-<class>: <RUNTIME> ns

the "ns" in the end should be parsed by userspace to know
what it is about.

>
>Not sure it is a good idea though. If you instead added a new key for 
>the gpu time what would be the downside in your view? Like:
>
>drm-engine-<class>: <uint> ticks
>drm-ticks-<class>: <uint>
>
>Or maybe even obsfuscate/generalise as:
>
>drm-engine-<class>: <uint> gpu-time
>drm-gpu-time-<class>: <uint>

I think both work, but I fail to see the advantage. This alternative is
slightly heavier on the parsing side since you have to correlate the
values from 2 keys, possibly dealing with them appearing in different
order. The only possible breakage remains with this alternative: if
userspace didn't parse the unit before. I checked nvtop and htop and
they were doing the right thing. I sent a fix to igt a few weeks back
for it to consider the unit:
https://lore.kernel.org/igt-dev/20240405060056.59379-8-lucas.demarchi@intel.com/

>
>Potentially could also add a key saying how much wall time is one unit 
>of GPU time.

I wouldn't add it really as it may not make sense depending on the
vendor and or usage. Examples: the gpu time may be different for
different engines depending on where they are located (tile/gt). The
correlation with CPU time is different when running in VF mode, and may
change in runtime depending on the number of VFs. +Michal.

Also, if the userspace side really wants to know (why would it?)
it could be just calculate from 2 samples (possibly repeated a few
times as it updates the output).

>
>Or.. would even the existing drm-cycles, plus abuse of drm-maxfreq, 
>work? Ticks == cycles, maxfreq == ticks per wall second.

I think it'd be up to gpu vendor what clock backs this time. For the
current Intel cards, it's the refclock and it doesn't vary the
frequency.

>
>Secondly, wrap behaviour every 25-30 seconds patch 6/7 describes 
>definitely breaks the format spec and in my view should be worked 
>around in the driver:
>
>"""
>Values are not required to be constantly monotonic if it makes the driver
>implementation easier, but are required to catch up with the 
>previously reported
>larger value within a reasonable period. Upon observing a value lower 
>than what
>was previously read, userspace is expected to stay with that larger previous
>value until a monotonic update is seen.
>"""

but this is the behavior for dealing with "ns", not with "ticks". We can
add a worker to run every so often.  When I got the original patches
from Umesh, we had this worker implemented, but dealing with that was
complex and wasting a lot of cycles since it was always running (except
if we didn't have any exec_queue). Here is my tentative to simplify it
since I don't really see the need to account for a ~25s wrap around.

thanks
Lucas De Marchi

>
>Regards,
>
>Tvrtko
>
>>
>>Test-with: https://lore.kernel.org/igt-dev/20240405060056.59379-1-lucas.demarchi@intel.com/
>>
>>Lucas De Marchi (5):
>>   drm/xe: Promote xe_hw_engine_class_to_str()
>>   drm/xe: Add XE_ENGINE_CLASS_OTHER to str conversion
>>   drm/xe: Add helper to capture engine timestamp
>>   drm/xe/client: Print runtime to fdinfo
>>   HACK: simple gputop-like impl in python
>>
>>Umesh Nerlige Ramappa (2):
>>   drm/xe/lrc: Add helper to capture context timestamp
>>   drm/xe: Add helper to capture context runtime
>>
>>  drivers/gpu/drm/xe/regs/xe_lrc_layout.h       |   1 +
>>  drivers/gpu/drm/xe/xe_device_types.h          |   9 ++
>>  drivers/gpu/drm/xe/xe_drm_client.c            |  81 ++++++++++++-
>>  drivers/gpu/drm/xe/xe_exec_queue.c            |  37 ++++++
>>  drivers/gpu/drm/xe/xe_exec_queue.h            |   1 +
>>  drivers/gpu/drm/xe/xe_hw_engine.c             |  29 +++++
>>  drivers/gpu/drm/xe/xe_hw_engine.h             |   4 +
>>  drivers/gpu/drm/xe/xe_hw_engine_class_sysfs.c |  18 ---
>>  drivers/gpu/drm/xe/xe_lrc.c                   |  11 ++
>>  drivers/gpu/drm/xe/xe_lrc.h                   |   2 +
>>  drivers/gpu/drm/xe/xe_lrc_types.h             |   3 +
>>  drivers/gpu/drm/xe/xe_sched_job.c             |   2 +
>>  pyfdinfo                                      | 113 ++++++++++++++++++
>>  13 files changed, 292 insertions(+), 19 deletions(-)
>>  create mode 100755 pyfdinfo
>>