[PATCH] drm/i915/pmu: Match frequencies reported by PMU and sysfs

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Tue Oct 4 13:05:55 UTC 2022


On 04/10/2022 14:00, Tvrtko Ursulin wrote:
> 
> On 04/10/2022 10:29, Tvrtko Ursulin wrote:
>>
>> On 03/10/2022 20:24, Ashutosh Dixit wrote:
>>> PMU and sysfs use different wakeref's to "interpret" zero freq. Sysfs 
>>> uses
>>> runtime PM wakeref (see intel_rps_read_punit_req and
>>> intel_rps_read_actual_frequency). PMU uses the GT parked/unparked
>>> wakeref. In general the GT wakeref is held for less time that the 
>>> runtime
>>> PM wakeref which causes PMU to report a lower average freq than the 
>>> average
>>> freq obtained from sampling sysfs.
>>>
>>> To resolve this, use the same freq functions (and wakeref's) in PMU as
>>> those used in sysfs.
>>>
>>> Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7025
>>> Reported-by: Ashwin Kumar Kulkarni <ashwin.kumar.kulkarni at intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>>> Signed-off-by: Ashutosh Dixit <ashutosh.dixit at intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_pmu.c | 27 ++-------------------------
>>>   1 file changed, 2 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_pmu.c 
>>> b/drivers/gpu/drm/i915/i915_pmu.c
>>> index 958b37123bf1..eda03f264792 100644
>>> --- a/drivers/gpu/drm/i915/i915_pmu.c
>>> +++ b/drivers/gpu/drm/i915/i915_pmu.c
>>> @@ -371,37 +371,16 @@ static void
>>>   frequency_sample(struct intel_gt *gt, unsigned int period_ns)
>>>   {
>>>       struct drm_i915_private *i915 = gt->i915;
>>> -    struct intel_uncore *uncore = gt->uncore;
>>>       struct i915_pmu *pmu = &i915->pmu;
>>>       struct intel_rps *rps = &gt->rps;
>>>       if (!frequency_sampling_enabled(pmu))
>>>           return;
>>> -    /* Report 0/0 (actual/requested) frequency while parked. */
>>> -    if (!intel_gt_pm_get_if_awake(gt))
>>> -        return;
>>> -
>>>       if (pmu->enable & config_mask(I915_PMU_ACTUAL_FREQUENCY)) {
>>> -        u32 val;
>>> -
>>> -        /*
>>> -         * We take a quick peek here without using forcewake
>>> -         * so that we don't perturb the system under observation
>>> -         * (forcewake => !rc6 => increased power use). We expect
>>> -         * that if the read fails because it is outside of the
>>> -         * mmio power well, then it will return 0 -- in which
>>> -         * case we assume the system is running at the intended
>>> -         * frequency. Fortunately, the read should rarely fail!
>>> -         */
>>> -        val = intel_uncore_read_fw(uncore, GEN6_RPSTAT1);
>>> -        if (val)
>>> -            val = intel_rps_get_cagf(rps, val);
>>> -        else
>>> -            val = rps->cur_freq;
>>> -
>>>           add_sample_mult(&pmu->sample[__I915_SAMPLE_FREQ_ACT],
>>> -                intel_gpu_freq(rps, val), period_ns / 1000);
>>> +                intel_rps_read_actual_frequency(rps),
>>> +                period_ns / 1000);
>>>       }
>>>       if (pmu->enable & config_mask(I915_PMU_REQUESTED_FREQUENCY)) {
>>
>> What is software tracking of requested frequency showing when GT is 
>> parked or runtime suspended? With this change sampling would be 
>> outside any such checks so we need to be sure reported value makes sense.
>>
>> Although more important open is around what is actually correct.
>>
>> For instance how does the patch affect RC6 and power? I don't know how 
>> power management of different blocks is wired up, so personally I 
>> would only be able to look at it empirically. In other words what I am 
>> asking is this - if we changed from skipping obtaining forcewake even 
>> when unparked, to obtaining forcewake if not runtime suspended - what 
>> hardware blocks does that power up and how it affects RC6 and power? 
>> Can it affect actual frequency or not? (Will "something" power up the 
>> clocks just because we will be getting forcewake?)
>>
>> Or maybe question simplified - does 200Hz polling on existing sysfs 
>> actual frequency field disturbs the system under some circumstances? 
>> (Increases power and decreases RC6.) If it does then that would be a 
>> problem. We want a solution which shows the real data, but where the 
>> act of monitoring itself does not change it too much. If it doesn't 
>> then it's okay.
>>
>> Could you somehow investigate on these topics? Maybe log RAPL GPU 
>> power while polling on sysfs, versus getting the actual frequency from 
>> the existing PMU implementation and see if that shows anything? Or 
>> actually simpler - RAPL GPU power for current PMU intel_gpu_top versus 
>> this patch? On idle(-ish) desktop workloads perhaps? Power and 
>> frequency graphed for both.
> 
> Another thought - considering that bspec says for 0xa01c "This register 
> reflects real-time values and thus does not have a pre-determined 
> default value out of reset" - could it be that it also does not reflect 
> a real value when GPU is not executing anything (so zero), just happens 
> to be not runtime suspended? That would mean sysfs reads could maybe 
> show last known value? Just a thought to check.
> 
> I've also tried on my Alderlake desktop:
> 
> 1)
> 
> while true; do cat gt_act_freq_mhz >/dev/null; sleep 0.005; done
> 
> This costs ~120mW of GPU power and ~20% decrease in RC6.
> 
> 
> 2)
> 
> intel_gpu_top -l -s 5 >/dev/null

This "-s 5" was pointless though. :)

Regards,

Tvrtko

> 
> This costs no power or RC6.
> 
> I have also never observed sysfs to show below min freq. This was with 
> no desktop so it's possible this register indeed does not reflect the 
> real situation when things are idle.
> 
> So I think it is possible sysfs value is the misleading one.
> 
> Regards,
> 
> Tvrtko


More information about the dri-devel mailing list