[PATCH v3] drm/xe/pmu: Add GT frequency events

Dixit, Ashutosh ashutosh.dixit at intel.com
Mon Mar 24 16:24:47 UTC 2025


On Sat, 22 Mar 2025 19:47:04 -0700, Lucas De Marchi wrote:
>
> On Sat, Mar 22, 2025 at 02:06:09PM -0700, Ashutosh Dixit wrote:
> > On Sat, 22 Mar 2025 05:57:12 -0700, Lucas De Marchi wrote:
> >>
> >> On Fri, Mar 21, 2025 at 03:45:21PM -0700, Belgaumkar, Vinay wrote:
> >> >
> >> > On 3/13/2025 2:45 PM, Lucas De Marchi wrote:
> >> >> On Tue, Mar 11, 2025 at 05:14:08PM -0700, Vinay Belgaumkar wrote:
> >> >>> Define PMU events for GT frequency (actual and requested). This is
> >> >>> a port from the i915 driver implementation, where an internal timer
> >> >>> is used to aggregate GT frequencies over certain fixed interval.
> >> >>> Following PMU events are being added-
> >> >>                      ^
> >> >> why do you use "-"  instead of ":"?
> >> >>
> >> >>>
> >> >>>  xe_0000_00_02.0/gt-actual-frequency/              [Kernel PMU event]
> >> >>>  xe_0000_00_02.0/gt-requested-frequency/           [Kernel PMU event]
> >> >>>
> >> >>> Standard perf commands can be used to monitor GT frequency-
> >> >>>  $ perf stat -e xe_0000_00_02.0/gt-requested-frequency,gt=0/ -I1000
> >> >>>
> >> >>>     1.001175175                700 M xe/gt-requested-frequency,gt=0/
> >> >>>     2.005891881                703 M xe/gt-requested-frequency,gt=0/
> >> >>>     3.007318169                700 M xe/gt-requested-frequency,gt=0/
> >> >>>
> >> >>> Actual frequencies will be reported as 0 when GT is in C6.
> >> >>
> >> >>
> >> >> I think we need to document somewhere, but at the very least in the
> >> >> commit message what the event count actually is. Let me see if I get
> >> >> this right:  if userspace is sampling every 1sec, and assuming the gpu
> >> >> is at 700MHz for the first 0.5sec and at 1.4 GHz, userspace should
> >> >> expect to see ~1050 as the value. Correct?  I find this frequency
> >> >> handling very different from anything else reported via perf. Other
> >> >> than i915, are there any other cases you know of?
> >> >
> >> > Yes,  user will see a gradual ramp. One thing that could work is
> >> > something along these lines:
> >> >
> >> > @@ -324,7 +327,10 @@ static void xe_pmu_event_update(struct perf_event
> >> > *event)
> >> >                 new = __xe_pmu_event_read(event);
> >> >         } while (!local64_try_cmpxchg(&hwc->prev_count, &prev, new));
> >> >
> >> > -       local64_add(new - prev, &event->count);
> >> > +       if (is_gt_frequency_event(id))
> >> > +               local64_add(new, &event->count);
> >> > +       else
> >> > +               local64_add(new - prev, &event->count);
> >> >
> >> > This will give us instantaneous values and will not need the use of an
> >> > internal timer.  Should be ok to do it this way?
> >>
> >> yes, I think it'd be preferred. It's much simpler and don't prevent us
> >> from eventually adding an avg_* event if the needs arise. Which would
> >> also be clearer on what that is.
> >
> > Instantaneous values would show 0 if the sampling instant landed when gt is
>
> instantaneous values make much more sense from the perf point of view
> IMO. perf is **sampling** them - other than the same thing in i915, I've
> never seen an event that returns an average (not even documenting it as
> such).
>
> What would also make sense would be to support perf-record so the
> samples are recorded in the buffer that is passed to userspace. In this
> case setting up a timer to save them (since we don't have interrupt for
> our pmu) would be nice.

Afaik, typically userspace calls into the kernel for PMU counters
relatively infrequently, say once a second (1 Hz). However, if this low
rate does not yield good freq values (say freq is showing lots of 0's), now
userspace will need to call into the kernel more frequently, say 10 Hz to
100 Hz, just to get good freq values (all other counters are fine at 1
Hz). This increased rate of calling into the kernel may be ok for 'perf
record' but not sure what happens to say 'gputop'.

I think considerations like this went into i915 deciding to average in the
kernel.

Also, if we are going to expose instantaneous freq, if gt is in C6 (when
userspace calls into the kernel to sample), could we save and return the
freq /before/ gt went into C6 (to avoid returning 0 freq, and returning the
last freq as it was when workload was running)? This might be sufficient
and remove the need to sample more frequently, though not sure how
complex to implement.

>
> > in C6. Also instantaneous values are already available via sysfs.
>
> with the caveat of a slower interface, but yeah... that's why I was
> asking earlier: do we really need this if it's already available via
> sysfs? I was told turbostat wants to consolidate all the data to be read
> via perf instead of reading a bunch of files from sysfs.

OK.

>
> > Also, the way i915 calculates the average freq is itself
> > controversial. i915 computes average freq when gt is awake and disregards
> > those intervals when gt is in C6. There was this patch to suggest that
> > average freq also take into account those intervals when gt is in C6:
> >
> > https://patchwork.freedesktop.org/series/109372/
> >
> > But it was not accepted since it would change uapi.
>
> one more reason not to fall in the same trap. Let userspace decide if it
> wants the average or not. If it'd need to sample too frequently, then I
> think we need to look at supporting perf-record rather than doing the
> average (which is actually an integration, using the rectangle method,
> and then dividing by the perf_read delta time). At the very least,
> rename the event to make the avg explicit.

OK, we can return instantaneous freq for now and if that doesn't work out,
add an average freq later?

>
> Lucas De Marchi

Thanks.
--
Ashutosh


More information about the Intel-xe mailing list