[PATCH v3] drm/xe/pmu: Add GT frequency events
Lucas De Marchi
lucas.demarchi at intel.com
Mon Mar 24 19:07:32 UTC 2025
On Mon, Mar 24, 2025 at 09:24:47AM -0700, Ashutosh Dixit wrote:
>On Sat, 22 Mar 2025 19:47:04 -0700, Lucas De Marchi wrote:
>>
>> On Sat, Mar 22, 2025 at 02:06:09PM -0700, Ashutosh Dixit wrote:
>> > On Sat, 22 Mar 2025 05:57:12 -0700, Lucas De Marchi wrote:
>> >>
>> >> On Fri, Mar 21, 2025 at 03:45:21PM -0700, Belgaumkar, Vinay wrote:
>> >> >
>> >> > On 3/13/2025 2:45 PM, Lucas De Marchi wrote:
>> >> >> On Tue, Mar 11, 2025 at 05:14:08PM -0700, Vinay Belgaumkar wrote:
>> >> >>> Define PMU events for GT frequency (actual and requested). This is
>> >> >>> a port from the i915 driver implementation, where an internal timer
>> >> >>> is used to aggregate GT frequencies over certain fixed interval.
>> >> >>> Following PMU events are being added-
>> >> >> ^
>> >> >> why do you use "-" instead of ":"?
>> >> >>
>> >> >>>
>> >> >>> xe_0000_00_02.0/gt-actual-frequency/ [Kernel PMU event]
>> >> >>> xe_0000_00_02.0/gt-requested-frequency/ [Kernel PMU event]
>> >> >>>
>> >> >>> Standard perf commands can be used to monitor GT frequency-
>> >> >>> $ perf stat -e xe_0000_00_02.0/gt-requested-frequency,gt=0/ -I1000
>> >> >>>
>> >> >>> 1.001175175 700 M xe/gt-requested-frequency,gt=0/
>> >> >>> 2.005891881 703 M xe/gt-requested-frequency,gt=0/
>> >> >>> 3.007318169 700 M xe/gt-requested-frequency,gt=0/
>> >> >>>
>> >> >>> Actual frequencies will be reported as 0 when GT is in C6.
>> >> >>
>> >> >>
>> >> >> I think we need to document somewhere, but at the very least in the
>> >> >> commit message what the event count actually is. Let me see if I get
>> >> >> this right: if userspace is sampling every 1sec, and assuming the gpu
>> >> >> is at 700MHz for the first 0.5sec and at 1.4 GHz, userspace should
>> >> >> expect to see ~1050 as the value. Correct? I find this frequency
>> >> >> handling very different from anything else reported via perf. Other
>> >> >> than i915, are there any other cases you know of?
>> >> >
>> >> > Yes, user will see a gradual ramp. One thing that could work is
>> >> > something along these lines:
>> >> >
>> >> > @@ -324,7 +327,10 @@ static void xe_pmu_event_update(struct perf_event
>> >> > *event)
>> >> > new = __xe_pmu_event_read(event);
>> >> > } while (!local64_try_cmpxchg(&hwc->prev_count, &prev, new));
>> >> >
>> >> > - local64_add(new - prev, &event->count);
>> >> > + if (is_gt_frequency_event(id))
>> >> > + local64_add(new, &event->count);
>> >> > + else
>> >> > + local64_add(new - prev, &event->count);
>> >> >
>> >> > This will give us instantaneous values and will not need the use of an
>> >> > internal timer. Should be ok to do it this way?
>> >>
>> >> yes, I think it'd be preferred. It's much simpler and don't prevent us
>> >> from eventually adding an avg_* event if the needs arise. Which would
>> >> also be clearer on what that is.
>> >
>> > Instantaneous values would show 0 if the sampling instant landed when gt is
>>
>> instantaneous values make much more sense from the perf point of view
>> IMO. perf is **sampling** them - other than the same thing in i915, I've
>> never seen an event that returns an average (not even documenting it as
>> such).
>>
>> What would also make sense would be to support perf-record so the
>> samples are recorded in the buffer that is passed to userspace. In this
>> case setting up a timer to save them (since we don't have interrupt for
>> our pmu) would be nice.
>
>Afaik, typically userspace calls into the kernel for PMU counters
>relatively infrequently, say once a second (1 Hz). However, if this low
>rate does not yield good freq values (say freq is showing lots of 0's), now
why do you say it's not getting good value? is 0 not a good value?
Should the kernel be returning something else instead of doing an
average with "lots of zeros" and returning it?
>userspace will need to call into the kernel more frequently, say 10 Hz to
>100 Hz, just to get good freq values (all other counters are fine at 1
It should be perfectly fine for an application to setup a 10Hz timer and
getting it multiple times if it wants to average.... I don't think it
needs to be at the hundreds frequency.
>Hz). This increased rate of calling into the kernel may be ok for 'perf
>record' but not sure what happens to say 'gputop'.
oh no, perf-record works diferently... it's not calling into the
kernel with that frequency. It's perf-stat that works that way.
For perf-record it more like kernel pushing samples in a ringbuffer, shared
with userspace, and then waking it. We (currently) don't support that
mode in our implementation:
static int xe_pmu_event_init(struct perf_event *event)
{
...
/* unsupported modes and filters */
if (event->attr.sample_period) /* no sampling */
return -EINVAL;
...
}
>
>I think considerations like this went into i915 deciding to average in the
>kernel.
as you noticed, maybe there was actually a lack of thought on the matter
and then later it couldn't be changed anymore... ? I saw several odd
things in the i915 side. Exposing something with a Hz unit is one of
them.
>
>Also, if we are going to expose instantaneous freq, if gt is in C6 (when
>userspace calls into the kernel to sample), could we save and return the
>freq /before/ gt went into C6 (to avoid returning 0 freq, and returning the
>last freq as it was when workload was running)? This might be sufficient
>and remove the need to sample more frequently, though not sure how
>complex to implement.
if that data would make more sense for userspace, then I think it should
be ok. So, let's say gt in in C6, what does make more sense for
gputop-like and turbostat?
1) read 0 from the kernel and report 0 to the user
2) read 0 from the kernel and report min_freq to the user (min_freq they
can read once from from sysfs)
3) read min_freq from the kernel
4) read the last frequency before going into C6
Lucas De Marchi
>
>>
>> > in C6. Also instantaneous values are already available via sysfs.
>>
>> with the caveat of a slower interface, but yeah... that's why I was
>> asking earlier: do we really need this if it's already available via
>> sysfs? I was told turbostat wants to consolidate all the data to be read
>> via perf instead of reading a bunch of files from sysfs.
>
>OK.
>
>>
>> > Also, the way i915 calculates the average freq is itself
>> > controversial. i915 computes average freq when gt is awake and disregards
>> > those intervals when gt is in C6. There was this patch to suggest that
>> > average freq also take into account those intervals when gt is in C6:
>> >
>> > https://patchwork.freedesktop.org/series/109372/
>> >
>> > But it was not accepted since it would change uapi.
>>
>> one more reason not to fall in the same trap. Let userspace decide if it
>> wants the average or not. If it'd need to sample too frequently, then I
>> think we need to look at supporting perf-record rather than doing the
>> average (which is actually an integration, using the rectangle method,
>> and then dividing by the perf_read delta time). At the very least,
>> rename the event to make the avg explicit.
>
>OK, we can return instantaneous freq for now and if that doesn't work out,
>add an average freq later?
>
>>
>> Lucas De Marchi
>
>Thanks.
>--
>Ashutosh
More information about the Intel-xe
mailing list