[PATCH v3] drm/xe/pmu: Add GT frequency events

Tue Mar 25 02:17:11 UTC 2025

On 3/24/2025 6:10 PM, Dixit, Ashutosh wrote:
> On Mon, 24 Mar 2025 15:14:55 -0700, Belgaumkar, Vinay wrote:
>> On Mon, Mar 24, 2025 at 02:07:32PM -0500, Lucas De Marchi wrote:
>>> On Mon, Mar 24, 2025 at 09:24:47AM -0700, Ashutosh Dixit wrote:
>>>> On Sat, 22 Mar 2025 19:47:04 -0700, Lucas De Marchi wrote:
>>>>> On Sat, Mar 22, 2025 at 02:06:09PM -0700, Ashutosh Dixit wrote:
>>>>>> On Sat, 22 Mar 2025 05:57:12 -0700, Lucas De Marchi wrote:
>>>>>>> On Fri, Mar 21, 2025 at 03:45:21PM -0700, Belgaumkar, Vinay wrote:
>>>>>>>> On 3/13/2025 2:45 PM, Lucas De Marchi wrote:
>>>>>>>>> On Tue, Mar 11, 2025 at 05:14:08PM -0700, Vinay Belgaumkar wrote:
>>>>>>>>>> Define PMU events for GT frequency (actual and requested).
>>>>>>>>>> This is a port from the i915 driver implementation, where
>>>>>>>>>> an internal timer is used to aggregate GT frequencies over certain fixed interval.
>>>>>>>>>> Following PMU events are being added-
>>>>>>>>>                       ^
>>>>>>>>> why do you use "-"  instead of ":"?
>>>>>>>>>
>>>>>>>>>>   xe_0000_00_02.0/gt-actual-frequency/              [Kernel
>>>>>>>>>> PMU event]
>>>>>>>>>>   xe_0000_00_02.0/gt-requested-frequency/           [Kernel
>>>>>>>>>> PMU event]
>>>>>>>>>>
>>>>>>>>>> Standard perf commands can be used to monitor GT frequency-
>>>>>>>>>>   $ perf stat -e
>>>>>>>>>> xe_0000_00_02.0/gt-requested-frequency,gt=0/ -I1000
>>>>>>>>>>
>>>>>>>>>>      1.001175175                700 M
>>>>>>>>>> xe/gt-requested-frequency,gt=0/
>>>>>>>>>>      2.005891881                703 M
>>>>>>>>>> xe/gt-requested-frequency,gt=0/
>>>>>>>>>>      3.007318169                700 M
>>>>>>>>>> xe/gt-requested-frequency,gt=0/
>>>>>>>>>>
>>>>>>>>>> Actual frequencies will be reported as 0 when GT is in C6.
>>>>>>>>>
>>>>>>>>> I think we need to document somewhere, but at the very least
>>>>>>>>> in the commit message what the event count actually is. Let
>>>>>>>>> me see if I get this right:  if userspace is sampling every
>>>>>>>>> 1sec, and assuming the gpu is at 700MHz for the first 0.5sec
>>>>>>>>> and at 1.4 GHz, userspace should expect to see ~1050 as the
>>>>>>>>> value. Correct?  I find this frequency handling very
>>>>>>>>> different from anything else reported via perf. Other than i915, are there any other cases you know of?
>>>>>>>> Yes,  user will see a gradual ramp. One thing that could work
>>>>>>>> is something along these lines:
>>>>>>>>
>>>>>>>> @@ -324,7 +327,10 @@ static void xe_pmu_event_update(struct
>>>>>>>> perf_event
>>>>>>>> *event)
>>>>>>>>                  new = __xe_pmu_event_read(event);
>>>>>>>>          } while (!local64_try_cmpxchg(&hwc->prev_count,
>>>>>>>> &prev, new));
>>>>>>>>
>>>>>>>> -       local64_add(new - prev, &event->count);
>>>>>>>> +       if (is_gt_frequency_event(id))
>>>>>>>> +               local64_add(new, &event->count);
>>>>>>>> +       else
>>>>>>>> +               local64_add(new - prev, &event->count);
>>>>>>>>
>>>>>>>> This will give us instantaneous values and will not need the
>>>>>>>> use of an internal timer.  Should be ok to do it this way?
>>>>>>> yes, I think it'd be preferred. It's much simpler and don't
>>>>>>> prevent us from eventually adding an avg_* event if the needs
>>>>>>> arise. Which would also be clearer on what that is.
>>>>>> Instantaneous values would show 0 if the sampling instant landed
>>>>>> when gt is
>>>>> instantaneous values make much more sense from the perf point of
>>>>> view IMO. perf is **sampling** them - other than the same thing in
>>>>> i915, I've never seen an event that returns an average (not even
>>>>> documenting it as such).
>>>>>
>>>>> What would also make sense would be to support perf-record so the
>>>>> samples are recorded in the buffer that is passed to userspace. In
>>>>> this case setting up a timer to save them (since we don't have
>>>>> interrupt for our pmu) would be nice.
>>>> Afaik, typically userspace calls into the kernel for PMU counters
>>>> relatively infrequently, say once a second (1 Hz). However, if this
>>>> low rate does not yield good freq values (say freq is showing lots
>>>> of 0's), now
>>> why do you say it's not getting good value? is 0 not a good value?
>>> Should the kernel be returning something else instead of doing an
>>> average with "lots of zeros" and returning it?
>> well, zero is not a good value because it is not real. The register is
>> zero because the GT is sleeping, but that is not the accurate thing.
> Not necessarily a not-good value. For example, if the freq is 1 GHz 50% of
> the time and GT is sleeping 50% of the time, you could say the average freq
> is 500 MHz. But userspace would have to sample at a high enough rate to
> figure out GT is sleeping 50% of the time. Maybe 10 or 20 Hz is enough, see
> below.
>
> i915 disregards C6 so would report the average as 1 GHz. The 200 Hz
> sampling rate for averaging in i915 was designed based on host turbo I
> believe.
i915 does not disregard C6. If the GT is in C6 long enough, the average 
actual frequency will also show 0 there.
>
>> but tbh we are already taking this decision to our sysfs interface:
>>
>> $ cat act_freq
>> 0
>>
>> But well, for this instantaneous it may be good... we just need to
>> educate the consumers to see the zero as gpu sleeping in a chart.  but
>> zero is an outlier to the average of the frequencies that it will bring
>> the mean to a very lower levels and the charts collected with PMU will be
>> so out of shape to the point it might be useless...
>>
>> True.  Better to stick with instantaneous value in this case.
>>
>>>> userspace will need to call into the kernel more frequently, say 10
>>>> Hz to
>>>> 100 Hz, just to get good freq values (all other counters are fine at
>>>> 1
>>> It should be perfectly fine for an application to setup a 10Hz timer
>>> and getting it multiple times if it wants to average.... I don't think
>>> it needs to be at the hundreds frequency.
> This could depend on how fast SLPC changes the freq and maybe @Belgaumkar,
> Vinay might know what is the typical rate at which userspace would need to
> sample to get a good average.
>
>>>> Hz). This increased rate of calling into the kernel may be ok for
>>>> 'perf record' but not sure what happens to say 'gputop'.
>>> oh no, perf-record works diferently... it's not calling into the
>>> kernel with that frequency. It's perf-stat that works that way.
>>>
>>> For perf-record it more like kernel pushing samples in a ringbuffer,
>>> shared with userspace, and then waking it. We (currently) don't
>>> support that mode in our implementation:
>>>
>>> static int xe_pmu_event_init(struct perf_event *event) {
>>> 	...
>>> 	/* unsupported modes and filters */
>>> 	if (event->attr.sample_period) /* no sampling */
>>> 		return -EINVAL;
>>> 	...
>>> }
>>>
>>>
>>>> I think considerations like this went into i915 deciding to average
>>>> in the kernel.
>>> as you noticed, maybe there was actually a lack of thought on the
>>> matter and then later it couldn't be changed anymore... ? I saw
>>> several odd things in the i915 side. Exposing something with a Hz unit
>>> is one of them.
>>>
>>>> Also, if we are going to expose instantaneous freq, if gt is in C6
>>>> (when userspace calls into the kernel to sample), could we save and
>>>> return the freq /before/ gt went into C6 (to avoid returning 0 freq,
>>>> and returning the last freq as it was when workload was running)?
>>>> This might be sufficient and remove the need to sample more
>>>> frequently, though not sure how complex to implement.
>>> if that data would make more sense for userspace, then I think it
>>> should be ok. So, let's say gt in in C6, what does make more sense for
>>> gputop-like and turbostat?
>>>
>>> 1) read 0 from the kernel and report 0 to the user
>>> 2) read 0 from the kernel and report min_freq to the user (min_freq they
>>>     can read once from from sysfs)
>>> 3) read min_freq from the kernel
>>> 4) read the last frequency before going into C6
> This depends on what we want userland to do:
>
> * If we are asking userspace to read the freq's from PMU at a high rate and
>    then average those freq's, we should report 0
i915 does report 0 as well after a few idle iterations.
>
> * What I was saying was, is it possible to always return a "representative"
>    freq, so that userspace does not have to sample at a high rate and not
>    average the freq. If we can return this "representative" freq, without
>    averaging in the kernel.
Not sure what's the best way to calculate something like this. The algo 
would certainly need special handling when the GPU is various shades of 
busy.
>
>    Can this freq be the last requested/actual freq before gt went into C6?
>    If SLPC is changing the freq rapidly, this would not be accurate either,
>    but maybe it's good enough?
KMD will not know when exactly GT will enter C6.
>
>    And we'd have to tell userspace that it should *not* average the
>    retrieved freq, that would be inaccurate. If they really want an accurate
>    average they should use sysfs.
>
>> The GT requested frequency will already show us the last requested
>> value. Assuming no throttling, there is a good chance this is what the GT
>> was at. Besides, KMD does not control when GT goes to C6, GuC does the
>> idle indication (since we are using GuC RC).
> As mentioned on v4, if we are not waking the card up, the requested freq
> would itself be 0?
No, requested freq register will always remain at the last requested 
value. We are using a forcewake in the KMD function that reads the 
MMIO(blitter power well). This register retains it's value.
>
>>>>>> in C6. Also instantaneous values are already available via sysfs.
>>>>> with the caveat of a slower interface, but yeah... that's why I
>>>>> was asking earlier: do we really need this if it's already
>>>>> available via sysfs? I was told turbostat wants to consolidate all
>>>>> the data to be read via perf instead of reading a bunch of files from sysfs.
>>>> OK.
>>>>
>>>>>> Also, the way i915 calculates the average freq is itself
>>>>>> controversial. i915 computes average freq when gt is awake and
>>>>>> disregards those intervals when gt is in C6. There was this
>>>>>> patch to suggest that average freq also take into account those intervals when gt is in C6:
>>>>>>
>>>>>> https://patchwork.freedesktop.org/series/109372/
>>>>>>
>>>>>> But it was not accepted since it would change uapi.
>>>>> one more reason not to fall in the same trap. Let userspace decide
>>>>> if it wants the average or not. If it'd need to sample too
>>>>> frequently, then I think we need to look at supporting perf-record
>>>>> rather than doing the average (which is actually an integration,
>>>>> using the rectangle method, and then dividing by the perf_read
>>>>> delta time). At the very least, rename the event to make the avg explicit.
>>>> OK, we can return instantaneous freq for now and if that doesn't
>>>> work out, add an average freq later?

Makes sense to keep it instantaneous for now. Sent v4 with that change - 
https://patchwork.freedesktop.org/patch/644758/?series=144164&rev=6

Thanks,

Vinay.