[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

Mon Oct 18 20:35:15 UTC 2021

On Mon, Oct 18, 2021 at 11:35:44AM -0700, Umesh Nerlige Ramappa wrote:
>On Mon, Oct 18, 2021 at 08:58:01AM +0100, Tvrtko Ursulin wrote:
>>
>>
>>On 16/10/2021 00:47, Umesh Nerlige Ramappa wrote:
>>>With GuC handling scheduling, i915 is not aware of the time that a
>>>context is scheduled in and out of the engine. Since i915 pmu relies on
>>>this info to provide engine busyness to the user, GuC shares this info
>>>with i915 for all engines using shared memory. For each engine, this
>>>info contains:
>>>
>>>- total busyness: total time that the context was running (total)
>>>- id: id of the running context (id)
>>>- start timestamp: timestamp when the context started running (start)
>>>
>>>At the time (now) of sampling the engine busyness, if the id is valid
>>>(!= ~0), and start is non-zero, then the context is considered to be
>>>active and the engine busyness is calculated using the below equation
>>>
>>>	engine busyness = total + (now - start)
>>>
>>>All times are obtained from the gt clock base. For inactive contexts,
>>>engine busyness is just equal to the total.
>>>
>>>The start and total values provided by GuC are 32 bits and wrap around
>>>in a few minutes. Since perf pmu provides busyness as 64 bit
>>>monotonically increasing values, there is a need for this implementation
>>>to account for overflows and extend the time to 64 bits before returning
>>>busyness to the user. In order to do that, a worker runs periodically at
>>>frequency = 1/8th the time it takes for the timestamp to wrap. As an
>>>example, that would be once in 27 seconds for a gt clock frequency of
>>>19.2 MHz.
>>>
>>>Note:
>>>There might be an overaccounting of busyness due to the fact that GuC
>>>may be updating the total and start values while kmd is reading them.
>>>(i.e kmd may read the updated total and the stale start). In such a
>>>case, user may see higher busyness value followed by smaller ones which
>>>would eventually catch up to the higher value.
>>>
>>>v2: (Tvrtko)
>>>- Include details in commit message
>>>- Move intel engine busyness function into execlist code
>>>- Use union inside engine->stats
>>>- Use natural type for ping delay jiffies
>>>- Drop active_work condition checks
>>>- Use for_each_engine if iterating all engines
>>>- Drop seq locking, use spinlock at guc level to update engine stats
>>>- Document worker specific details
>>>
>>>v3: (Tvrtko/Umesh)
>>>- Demarcate guc and execlist stat objects with comments
>>>- Document known over-accounting issue in commit
>>>- Provide a consistent view of guc state
>>>- Add hooks to gt park/unpark for guc busyness
>>>- Stop/start worker in gt park/unpark path
>>>- Drop inline
>>>- Move spinlock and worker inits to guc initialization
>>>- Drop helpers that are called only once
>>>
>>>v4: (Tvrtko/Matt/Umesh)
>>>- Drop addressed opens from commit message
>>>- Get runtime pm in ping, remove from the park path
>>>- Use cancel_delayed_work_sync in disable_submission path
>>>- Update stats during reset prepare
>>>- Skip ping if reset in progress
>>>- Explicitly name execlists and guc stats objects
>>>- Since disable_submission is called from many places, move resetting
>>>  stats to intel_guc_submission_reset_prepare
>>>
>>>v5: (Tvrtko)
>>>- Add a trylock helper that does not sleep and synchronize PMU event
>>>  callbacks and worker with gt reset
>>>
>>>v6: (CI BAT failures)
>>>- DUTs using execlist submission failed to boot since __gt_unpark is
>>>  called during i915 load. This ends up calling the guc busyness unpark
>>>  hook and results in kiskstarting an uninitialized worker. Let
>>>  park/unpark hooks check if guc submission has been initialized.
>>>- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
>>>  of that.
>>>
>>>v7: (CI) Fix igt at i915_selftest@live at gt_engines
>>>- For guc mode of submission the engine busyness is derived from gt time
>>>  domain. Use gt time elapsed as reference in the selftest.
>>>- Increase busyness calculation to 10ms duration to ensure batch runs
>>>  longer and falls within the busyness tolerances in selftest.
>>
>>[snip]
>>
>>>diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>index 75569666105d..24358bef6691 100644
>>>--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>@@ -234,6 +234,7 @@ static int live_engine_busy_stats(void *arg)
>>> 		struct i915_request *rq;
>>> 		ktime_t de, dt;
>>> 		ktime_t t[2];
>>>+		u32 gt_stamp;
>>> 		if (!intel_engine_supports_stats(engine))
>>> 			continue;
>>>@@ -251,10 +252,16 @@ static int live_engine_busy_stats(void *arg)
>>> 		ENGINE_TRACE(engine, "measuring idle time\n");
>>> 		preempt_disable();
>>> 		de = intel_engine_get_busy_time(engine, &t[0]);
>>>-		udelay(100);
>>>+		gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>>>+		udelay(10000);
>>> 		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>>>+		gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP) - gt_stamp;
>>> 		preempt_enable();
>>>-		dt = ktime_sub(t[1], t[0]);
>>>+
>>>+		dt = intel_engine_uses_guc(engine) ?
>>>+		     intel_gt_clock_interval_to_ns(engine->gt, gt_stamp) :
>>>+		     ktime_sub(t[1], t[0]);
>>
>>But this then shows the thing might not work for external callers 
>>like PMU who have no idea about GUCPMTIMESTAMP and cannot obtain it 
>>anyway.
>>
>>What is the root cause of the failure here, 100us or clock source? 
>>Is the granularity of GUCPMTIMESTAMP perhaps simply too coarse for 
>>100us test period? I forget what frequency it runs at.
>
>guc timestamp is ticking at 19.2 MHz in adlp/rkl (where I ran this).
>
>1)
>With 100us, often times I see that the batch has not yet started, so I 
>get busy time in the range 0 - 60 %. I increased the time such that 
>the batch runs long enough to make the scheduling time < 5%.
>
>2)
>I did a 100 runs on rkl/adlp. No failures on rkl.

Sorry, my bad, RKL failed with 91% busyness always (checked it again 
now). I think the first time I ran this, GuC was not enabled by default.

Regards,
Umesh

> On adlp, I saw one in 25 runs show 93%/94% busyness for rcs0 and fail 
>(expected is 95%).  For that I tried using the guc timestamp thinking 
>it would provide more accuracy. It did in my testing, but CI still 
>failed for rkl-guc (110% busyness!!), so now I just think we need to 
>tweak the expected busyness for guc.
>
>Is 1) acceptable?
>
>For 2) I am thinking of just changing the expected busyness to 90% 
>plus for guc mode OR should we just let it fail occassionally? 
>Thoughts?
>
>Thanks,
>Umesh
>
>>
>>Regards,
>>
>>Tvrtko