[Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

Thu Oct 7 15:42:33 UTC 2021

On Thu, Oct 07, 2021 at 09:17:34AM +0100, Tvrtko Ursulin wrote:
>
>On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
>>On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:
>
>[snip]
>
>>>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>>> static void disable_submission(struct intel_guc *guc)
>>>> {
>>>>     struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>>>+    struct intel_gt *gt = guc_to_gt(guc);
>>>>+    struct intel_engine_cs *engine;
>>>>+    enum intel_engine_id id;
>>>>+    unsigned long flags;
>>>>     if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>>>         GEM_BUG_ON(!guc->ct.enabled);
>>>>         __tasklet_disable_sync_once(&sched_engine->tasklet);
>>>>         sched_engine->tasklet.callback = NULL;
>>>>     }
>>>>+
>>>>+    cancel_delayed_work(&guc->timestamp.work);
>>>
>>>I am not sure when disable_submission gets called so a question - 
>>>could it be important to call cancel_delayed_work_sync here to 
>>>ensure if the worker was running it had exited before proceeding?
>>
>>disable_submission is called in the reset_prepare path for uc 
>>resets. I see this happening only with busy-hang test which does a 
>>global gt reset. The counterpart for this is the 
>>guc_init_engine_stats which is called post reset in the path to 
>>initialize GuC.
>>
>>I tried cancel_delayed_work_sync both here and in park. Seems to 
>>work fine, so will change the calls to _sync versions.
>
>From park is not allowed to sleep so can't do sync from there. It 
>might have been my question which put you on a wrong path, sorry. Now 
>I think question remains what happens if the ping worker happens to be 
>sampling GuC state as GuC is being reset? Do you need some sort of a 
>lock to protect that, or make sure worker skips if reset in progress?
>

If ping ran after the actual gt reset, we should be okay. If it ran 
after we reset prev_total and before gt reset, then we have bad 
busyness. At the same time, skipping ping risks timestamp overflow. I am 
thinking skip ping, but update all stats in the reset_prepare path.  
reset_prepare is running with pm runtime.

On a different note, during reset, we need to store now-start into the 
total_gt_clks also because we may lose that information in the next pmu 
query or ping (post reset). Maybe I will store active_clks instead of 
running in the stats to do that.

Thanks,
Umesh