[PATCH 1/4] drm/xe/pmu: Enable PMU interface

Wed Nov 27 20:58:16 UTC 2024

On Wed, Nov 27, 2024 at 01:11:14PM -0600, Lucas De Marchi wrote:
>On Wed, Nov 27, 2024 at 10:47:28AM -0800, Umesh Nerlige Ramappa wrote:
>>On Wed, Nov 27, 2024 at 11:13:24AM -0600, Lucas De Marchi wrote:
>>>On Wed, Nov 27, 2024 at 12:42:14PM +0530, Riana Tauro wrote:
>>>>Hi Lucas
>>>>
>>>>On 11/26/2024 12:37 AM, Umesh Nerlige Ramappa wrote:
>>>>>On Mon, Sep 02, 2024 at 10:01:11PM -0500, Lucas De Marchi wrote:
>>>>>>On Mon, Sep 02, 2024 at 11:29:05AM GMT, Aravind Iddamsetty wrote:
>>>>>>>
>>>>>>>On 30/08/24 03:40, Belgaumkar, Vinay wrote:
>>>>>>>>
>>>>>>>>On 8/28/2024 12:33 PM, Lucas De Marchi wrote:
>>>>>>>>>On Tue, Aug 27, 2024 at 09:41:04AM GMT, Vinay Belgaumkar wrote:
>>>>>>>>>>diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>>>>>>>>index b6fbe4988f2e..de6f39db618c 100644
>>>>>>>>>>--- a/include/uapi/drm/xe_drm.h
>>>>>>>>>>+++ b/include/uapi/drm/xe_drm.h
>>>>>>>>>>@@ -1389,6 +1389,40 @@ struct drm_xe_wait_user_fence {
>>>>>>>>>>    __u64 reserved[2];
>>>>>>>>>>};
>>>>>>>>>>
>>>>>>>>>>+/**
>>>>>>>>>>+ * DOC: XE PMU event config IDs
>>>>>>>>>>+ *
>>>>>>>>>>+ * Check 'man perf_event_open' to use the ID's 
>>>>>>>>>>XE_PMU_XXXX listed in xe_drm.h
>>>>>>>>>>+ * in 'struct perf_event_attr' as part of 
>>>>>>>>>>perf_event_open syscall to read a
>>>>>>>>>>+ * particular event.
>>>>>>>>>>+ *
>>>>>>>>>>+ * For example to open the XE_PMU_RENDER_GROUP_BUSY(0):
>>>>>>>>>>+ *
>>>>>>>>>>+ * .. code-block:: C
>>>>>>>>>>+ *
>>>>>>>>>>+ *    struct perf_event_attr attr;
>>>>>>>>>>+ *    long long count;
>>>>>>>>>>+ *    int cpu = 0;
>>>>>>>>>>+ *    int fd;
>>>>>>>>>>+ *
>>>>>>>>>>+ *    memset(&attr, 0, sizeof(struct perf_event_attr));
>>>>>>>>>>+ *    attr.type = type; // eg: 
>>>>>>>>>>/sys/bus/event_source/devices/ xe_0000_56_00.0/type
>>>>>>>>>>+ *    attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED;
>>>>>>>>>>+ *    attr.use_clockid = 1;
>>>>>>>>>>+ *    attr.clockid = CLOCK_MONOTONIC;
>>>>>>>>>>+ *    attr.config = XE_PMU_RENDER_GROUP_BUSY(0);
>>>>>>>>>>+ *
>>>>>>>>>>+ *    fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
>>>>>>>>>>+ */
>>>>>>>>>>+
>>>>>>>>>>+/*
>>>>>>>>>>+ * Top bits of every counter are GT id.
>>>>>>>>>>+ */
>>>>>>>>>>+#define __XE_PMU_GT_SHIFT (56)
>>>>>>>>>>+
>>>>>>>>>>+#define ___XE_PMU_OTHER(gt, x) \
>>>>>>>>>>+    (((__u64)(x)) | ((__u64)(gt) << __XE_PMU_GT_SHIFT))
>>>>>>>>>>+
>>>>>>>>>
>>>>>>>>>The perf uapi is self-describing and users should look 
>>>>>>>>>up on sysfs what
>>>>>>>>>to use. Example for i915 since it's what I'm currently working on:
>>>>>>>>>
>>>>>>>>>    $ cat /sys/bus/event_source/devices/i915/events/actual-frequency
>>>>>>>>>    config=0x100000
>>>>>>>>>    $ cat 
>>>>>>>>>/sys/bus/event_source/devices/i915/events/actual- 
>>>>>>>>>frequency.unit
>>>>>>>>>    M
>>>>>>>>>
>>>>>>>>>`perf list` works fine and doesn't know anything about this xe-only
>>>>>>>>>header. Why would we add anything here rather than 
>>>>>>>>>encourage other users
>>>>>>>>>to read from the generic interface?
>>>>>>>>
>>>>>>>>Agree. perf list | grep rc6 is sufficient.
>>>>>>>
>>>>>>>This was previously asked by Rodrigo https:// 
>>>>>>>patchwork.freedesktop.org/patch/555013/?series=119504&rev=5
>>>>>>>
>>>>>>>so what changed from then to now.
>>>>>>
>>>>>>2 different things. That rev you are pointing out had:
>>>>>>
>>>>>>include/uapi/drm/xe_drm.h            |  16 +
>>>>>>
>>>>>>so the uapi was being changed without proper doc. Rodrigo asked for the
>>>>>>documentation to be added, because that is the proper way to add things
>>>>>>to the uapi header.
>>>>>>
>>>>>>In this review what I noticed though is that the change shouldn't be
>>>>>>there in the first place. We shouldn't change anything in the drm/xe
>>>>>>UAPI header because this is actually an UAPI (via sysfs) coming from
>>>>>>perf.
>>>>>
>>>>>@Lucas,
>>>>>
>>>>>Agree. This is carried over from i915, where other constraints 
>>>>>may have led to adding some header bits for gt.
>>>>>
>>>>>I want to mention that now there are 2 events per engine - 
>>>>>ticks (actual engine run ticks) and total_ticks (total ticks 
>>>>>that the engine was alloted) which enable us to take a ratio 
>>>>>of deltas to calculate utilization at the engine level. If we 
>>>>>add the config for each possible combination of gt and vf_id, 
>>>>>we would have
>>>>>
>>>>>num_engines * 2 * num_gts * num_vfs
>>>
>>>not sure what you're trying to do here with that * 2...?
>>>
>>>what are you trying to read from the userspace side?  2 separate
>>>counters or just 1 with the embedded info?
>>
>>2 separate counters per engine. Both are required to calculate % 
>>busyness. This is similar to PCEU where we read 2 counters and then 
>>determine the % utilization.
>
>so you don't need to embed the 2 in the config, since they are 2
>separate counters, each having its own set of config bits.
>
>>
>>>
>>>Also, why is this considering num_vfs? Is this to monitor from the host
>>>all the VFs being used?
>>
>>Yes. Currently VF specific utilization can only be accessed from host.
>>
>>>If so, then we need to be careful what we are
>>>trying to support: there's no dynamic event creation in perf - we'd
>>>need to unregister and register the pmu again like it's done in the
>>>uncore pmu
>>
>>Yes, initially I was suggesting that only supported events should 
>>show up in sysfs, but later Riana pointed out that we cannot do it, 
>>so I was hoping we could just register events for max_vfs (which is 
>>known at driver load time). Although, that's not a clean way to do 
>>it.
>>
>>Ideally VF KMD should be responsible for it's own instance of PMU 
>>(like below), but reading the counters from VF is not supported yet,
>>
>>/sys/bus/event_source/devices/<host_device>/events
>>/sys/bus/event_source/devices/<vf1>/events
>>/sys/bus/event_source/devices/<vf2>/events
>
>this would make it not accessible for PF for real use cases (the device
>being used inside a VM), unless a) I'm missing something, or b) by <vf1>
>you mean the PF driver is registering that (i.e. it's visible on the
>host side only)

Right, I was just mentioning that VF is responsible for it's own 
counters, but we want to read those counters from a PF and we cannot use 
the above.

>
>>...
>>
>>What we have here is a special case where VF events are being 
>>accessed from host, so I am unclear on what the best interface would 
>>be.
>>
>>Maybe we just create new PMU instances in host for each provisioned 
>>VF and then list those events under host like below:
>>
>>/sys/bus/event_source/devices/<host_device>/events/vf1
>>/sys/bus/event_source/devices/<host_device>/events/vf2
>>...
>>
>>These sysfs directories will get created dynamically, but will be 
>>associated with separate PMU instance.
>
>these are all *host* events that the *PF* instance creates. I think you
>can't have it as directories like above and we should probably support
>it in the same way we support multiple pcie devices connected, i.e. by
>hand crafting the name of the device... so it would actually be:
>
>/sys/bus/event_source/devices/xe_<host_device>_pf/events/*
>/sys/bus/event_source/devices/xe_<host_device>_vf1/events/*
>/sys/bus/event_source/devices/xe_<host_device>_vf2/events/*

ok, looks like the name we register for each PMU will be placed in 
/sys/bus/event_source/devices/, so that's still okay.

>
>etc.... The biggest problem is that we can't really unregister them on
>vf teardown until the perf_pmu_unregister() is fixed.

ok, agree. I guess we should just wait for that fix to land before any 
of this can be merged then.

Thanks,
Umesh

>
>Lucas De Marchi
>
>>
>>Thoughts?
>>
>>Thanks,
>>Umesh
>>
>>>
>>>>>
>>>>>where num_vfs should (ideally) be the number of vfs provisioned.
>>>>
>>>>Here it will be max_vfs instead of provisioned num_vfs. Pmu 
>>>>event attributes are added on pmu_register.
>>>>
>>>>num_engines * 2 * num_gts * max_vfs
>>>>
>>>>Did not find a way to add/remove events dynamically.
>>>
>>>there isn't. other PMUs unregister and register it again... but that is
>>>only done on early probe, not something really dynamic. If we want to
>>>support something dynamic like that then we really need to wait the
>>>perf_pmu_unregister() to be fixed: not really possible without that.
>>>
>>>
>>>Lucas De Marchi
>>>
>>>>
>>>>Thanks
>>>>Riana
>>>>
>>>>>
>>>>>Do you think that's fine?
>>>
>>>you have config, config1 and config2, each with 64bits
>>>
>>>I'm not really sure what that *2 mean and we'd better have some room
>>>to grow it for ABI compatibility (even if a proper tool would read the
>>>sysfs to check what is the meaning of each bit).
>>>
>>>Lucas De Marchi
>>>
>>>>>
>>>>>Thanks,
>>>>>Umesh
>>>>>
>>>>>>
>>>>>>Lucas De Marchi
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Aravind.
>>>>>>>
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>
>>>>>>>>Vinay.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>Lucas De Marchi
>>>>