[PATCH v3] drm/xe/pmu: Add GT frequency events

Mon Mar 24 19:07:32 UTC 2025

On Mon, Mar 24, 2025 at 09:24:47AM -0700, Ashutosh Dixit wrote:
>On Sat, 22 Mar 2025 19:47:04 -0700, Lucas De Marchi wrote:
>>
>> On Sat, Mar 22, 2025 at 02:06:09PM -0700, Ashutosh Dixit wrote:
>> > On Sat, 22 Mar 2025 05:57:12 -0700, Lucas De Marchi wrote:
>> >>
>> >> On Fri, Mar 21, 2025 at 03:45:21PM -0700, Belgaumkar, Vinay wrote:
>> >> >
>> >> > On 3/13/2025 2:45 PM, Lucas De Marchi wrote:
>> >> >> On Tue, Mar 11, 2025 at 05:14:08PM -0700, Vinay Belgaumkar wrote:
>> >> >>> Define PMU events for GT frequency (actual and requested). This is
>> >> >>> a port from the i915 driver implementation, where an internal timer
>> >> >>> is used to aggregate GT frequencies over certain fixed interval.
>> >> >>> Following PMU events are being added-
>> >> >>                      ^
>> >> >> why do you use "-"  instead of ":"?
>> >> >>
>> >> >>>
>> >> >>>  xe_0000_00_02.0/gt-actual-frequency/              [Kernel PMU event]
>> >> >>>  xe_0000_00_02.0/gt-requested-frequency/           [Kernel PMU event]
>> >> >>>
>> >> >>> Standard perf commands can be used to monitor GT frequency-
>> >> >>>  $ perf stat -e xe_0000_00_02.0/gt-requested-frequency,gt=0/ -I1000
>> >> >>>
>> >> >>>     1.001175175                700 M xe/gt-requested-frequency,gt=0/
>> >> >>>     2.005891881                703 M xe/gt-requested-frequency,gt=0/
>> >> >>>     3.007318169                700 M xe/gt-requested-frequency,gt=0/
>> >> >>>
>> >> >>> Actual frequencies will be reported as 0 when GT is in C6.
>> >> >>
>> >> >>
>> >> >> I think we need to document somewhere, but at the very least in the
>> >> >> commit message what the event count actually is. Let me see if I get
>> >> >> this right:  if userspace is sampling every 1sec, and assuming the gpu
>> >> >> is at 700MHz for the first 0.5sec and at 1.4 GHz, userspace should
>> >> >> expect to see ~1050 as the value. Correct?  I find this frequency
>> >> >> handling very different from anything else reported via perf. Other
>> >> >> than i915, are there any other cases you know of?
>> >> >
>> >> > Yes,  user will see a gradual ramp. One thing that could work is
>> >> > something along these lines:
>> >> >
>> >> > @@ -324,7 +327,10 @@ static void xe_pmu_event_update(struct perf_event
>> >> > *event)
>> >> >                 new = __xe_pmu_event_read(event);
>> >> >         } while (!local64_try_cmpxchg(&hwc->prev_count, &prev, new));
>> >> >
>> >> > -       local64_add(new - prev, &event->count);
>> >> > +       if (is_gt_frequency_event(id))
>> >> > +               local64_add(new, &event->count);
>> >> > +       else
>> >> > +               local64_add(new - prev, &event->count);
>> >> >
>> >> > This will give us instantaneous values and will not need the use of an
>> >> > internal timer.  Should be ok to do it this way?
>> >>
>> >> yes, I think it'd be preferred. It's much simpler and don't prevent us
>> >> from eventually adding an avg_* event if the needs arise. Which would
>> >> also be clearer on what that is.
>> >
>> > Instantaneous values would show 0 if the sampling instant landed when gt is
>>
>> instantaneous values make much more sense from the perf point of view
>> IMO. perf is **sampling** them - other than the same thing in i915, I've
>> never seen an event that returns an average (not even documenting it as
>> such).
>>
>> What would also make sense would be to support perf-record so the
>> samples are recorded in the buffer that is passed to userspace. In this
>> case setting up a timer to save them (since we don't have interrupt for
>> our pmu) would be nice.
>
>Afaik, typically userspace calls into the kernel for PMU counters
>relatively infrequently, say once a second (1 Hz). However, if this low
>rate does not yield good freq values (say freq is showing lots of 0's), now

why do you say it's not getting good value? is 0 not a good value?
Should the kernel be returning something else instead of doing an
average with "lots of zeros" and returning it?

>userspace will need to call into the kernel more frequently, say 10 Hz to
>100 Hz, just to get good freq values (all other counters are fine at 1

It should be perfectly fine for an application to setup a 10Hz timer and
getting it multiple times if it wants to average.... I don't think it
needs to be at the hundreds frequency.

>Hz). This increased rate of calling into the kernel may be ok for 'perf
>record' but not sure what happens to say 'gputop'.

oh no, perf-record works diferently... it's not calling into the
kernel with that frequency. It's perf-stat that works that way.

For perf-record it more like kernel pushing samples in a ringbuffer, shared
with userspace, and then waking it. We (currently) don't support that
mode in our implementation:

static int xe_pmu_event_init(struct perf_event *event)
{
	...
	/* unsupported modes and filters */
	if (event->attr.sample_period) /* no sampling */
		return -EINVAL;
	...
}

>
>I think considerations like this went into i915 deciding to average in the
>kernel.

as you noticed, maybe there was actually a lack of thought on the matter
and then later it couldn't be changed anymore... ? I saw several odd
things in the i915 side. Exposing something with a Hz unit is one of
them.

>
>Also, if we are going to expose instantaneous freq, if gt is in C6 (when
>userspace calls into the kernel to sample), could we save and return the
>freq /before/ gt went into C6 (to avoid returning 0 freq, and returning the
>last freq as it was when workload was running)? This might be sufficient
>and remove the need to sample more frequently, though not sure how
>complex to implement.

if that data would make more sense for userspace, then I think it should
be ok. So, let's say gt in in C6, what does make more sense for
gputop-like and turbostat?

1) read 0 from the kernel and report 0 to the user
2) read 0 from the kernel and report min_freq to the user (min_freq they
    can read once from from sysfs)
3) read min_freq from the kernel
4) read the last frequency before going into C6

Lucas De Marchi

>
>>
>> > in C6. Also instantaneous values are already available via sysfs.
>>
>> with the caveat of a slower interface, but yeah... that's why I was
>> asking earlier: do we really need this if it's already available via
>> sysfs? I was told turbostat wants to consolidate all the data to be read
>> via perf instead of reading a bunch of files from sysfs.
>
>OK.
>
>>
>> > Also, the way i915 calculates the average freq is itself
>> > controversial. i915 computes average freq when gt is awake and disregards
>> > those intervals when gt is in C6. There was this patch to suggest that
>> > average freq also take into account those intervals when gt is in C6:
>> >
>> > https://patchwork.freedesktop.org/series/109372/
>> >
>> > But it was not accepted since it would change uapi.
>>
>> one more reason not to fall in the same trap. Let userspace decide if it
>> wants the average or not. If it'd need to sample too frequently, then I
>> think we need to look at supporting perf-record rather than doing the
>> average (which is actually an integration, using the rectangle method,
>> and then dividing by the perf_read delta time). At the very least,
>> rename the event to make the avg explicit.
>
>OK, we can return instantaneous freq for now and if that doesn't work out,
>add an average freq later?
>
>>
>> Lucas De Marchi
>
>Thanks.
>--
>Ashutosh