[Intel-gfx] [PATCH 6/7] drm/i915/pmu: Add running counter

Wed Jun 6 15:52:17 UTC 2018

On 06/06/2018 16:23, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-06 15:40:10)
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> We add a PMU counter to expose the number of requests currently executing
>> on the GPU.
>>
>> This is useful to analyze the overall load of the system.
>>
>> v2:
>>   * Rebase.
>>   * Drop floating point constant. (Chris Wilson)
>>
>> v3:
>>   * Change scale to 1024 for faster arithmetics. (Chris Wilson)
>>
>> v4:
>>   * Refactored for timer period accounting.
>>
>> v5:
>>   * Avoid 64-division. (Chris Wilson)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> ---
>>   #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
>>   
>> @@ -226,6 +227,13 @@ engines_sample(struct drm_i915_private *dev_priv, unsigned int period_ns)
>>                                          div_u64((u64)period_ns *
>>                                                  I915_SAMPLE_QUEUED_DIVISOR,
>>                                                  1000000));
>> +
>> +               if (engine->pmu.enable & BIT(I915_SAMPLE_RUNNING))
>> +                       add_sample_mult(&engine->pmu.sample[I915_SAMPLE_RUNNING],
>> +                                       last_seqno - current_seqno,
>> +                                       div_u64((u64)period_ns *
>> +                                               I915_SAMPLE_QUEUED_DIVISOR,
>> +                                               1000000));
> 
> Are we worried about losing precision with qd.ns?
> 
> add_sample_mult(SAMPLE, x, period_ns); here
> 
>> @@ -560,7 +569,8 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
>>                          val = engine->pmu.sample[sample].cur;
>>   
>>                          if (sample == I915_SAMPLE_QUEUED ||
>> -                           sample == I915_SAMPLE_RUNNABLE)
>> +                           sample == I915_SAMPLE_RUNNABLE ||
>> +                           sample == I915_SAMPLE_RUNNING)
>>                                  val = div_u64(val, MSEC_PER_SEC);  /* to qd */
> 
> and val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR, NSEC_PER_SEC);

Yeah that works, thanks.

> So that gives us a limit of ~1 million qd (assuming the user cares for
> about 1s intervals). Up to 8 million wlog with
> 
> 	val = div_u64(val * I915_SAMPLE_QUEUED_DIVISOR/8, NSEC_PER_SEC/8);

Or keep in qd.us as for frequency. I think precision is plenty in any case.

> Anyway, just concerned to have more than one 64b division and want to
> provoke you into thinking of a way of avoiding it :)

It is an optimized 64-bit divide, or 64-divide as I faltered in the 
commit message :), so not as bad as 64/64, but still your idea is very good.

Regards,

Tvrtko