[PATCH] i915/pmu: Fix zero delta busyness issue

Umesh Nerlige Ramappa umesh.nerlige.ramappa at intel.com
Tue Jan 21 23:58:03 UTC 2025


On Tue, Jan 21, 2025 at 06:10:34PM -0500, Rodrigo Vivi wrote:
>On Tue, Jan 21, 2025 at 02:25:57PM -0800, Umesh Nerlige Ramappa wrote:
>
>drm/i915/pmu as tag please...

will do

>
>> When running igt at gem_exec_balancer@individual for multiple iterations,
>> it is seen that the delta busyness returned by PMU is 0. The issue stems
>> from a combination of 2 implementation specific details:
>>
>> 1) gt_park is throttling __update_guc_busyness_stats() so that it does
>> not hog PCI bandwidth for some use cases. (Ref: 59bcdb564b3ba)
>>
>> 2) busyness implementation always returns monotonically increasing
>> counters. (Ref: cf907f6d29421)
>>
>> If an application queried an engine while it was active,
>> engine->stats.guc.running is set to true. Following that, if all PM
>> wakeref's are released, then gt is parked. At this time the throttling
>> of __update_guc_busyness_stats() may result in a missed update to the
>> running state of the engine (due to (1) above). This means subsequent
>> calls to guc_engine_busyness() will think that the engine is still
>> running and they will keep updating the cached counter (stats->total).
>> This results in an inflated cached counter.
>>
>> Later when the application runs a workload and queries for busyness, we
>> return the cached value since it is larger than the actual value (due to
>> (2) above)
>>
>> All subsequent queries will return the same large (inflated) value, so
>> the application sees a delta busyness of zero.
>>
>> In order to fix the issue,
>> - reset the running state of engines in busyness_park outside of the
>>   throttling code.
>> - when busyness is queried and PM wakeref is not active, do not
>>   calculate active engine time since we do not expect engines to be
>>   active without a wakeref.
>>
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13366
>> Fixes: cf907f6d2942 ("i915/guc: Ensure busyness counter increases motonically")
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> ---
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +++++++++++++++++-
>>  1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 12f1ba7ca9c1..b7a831e1fc85 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -1372,7 +1372,7 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>  	}
>>
>>  	total = intel_gt_clock_interval_to_ns(gt, stats->total_gt_clks);
>> -	if (stats->running) {
>> +	if (wakeref && stats->running) {
>
>do you really need to check this wakeref if you are now setting running to
>false earlier?

Maybe not. I did try it without this wakeref and that passes too.

>
>And if we do, what about moving this entire block to inside the
>existing if (wakeref) ?!

Yes, looks like I could move it inside the existing wakeref check block, 
but I think I will drop this check since running is being set to false 
in the gt_park code path.

>
>>  		u64 clk = guc->timestamp.gt_stamp - stats->start_gt_clk;
>>
>>  		total += intel_gt_clock_interval_to_ns(gt, clk);
>> @@ -1469,6 +1469,19 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
>>  	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>
>> +static void __update_guc_busyness_running_state(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.guc.running = false;
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>>  static void __update_guc_busyness_stats(struct intel_guc *guc)
>>  {
>>  	struct intel_gt *gt = guc_to_gt(guc);
>> @@ -1619,6 +1632,9 @@ void intel_guc_busyness_park(struct intel_gt *gt)
>>  	if (!guc_submission_initialized(guc))
>>  		return;
>>
>> +	/* Assume no engines are running and set running state to false */
>> +	__update_guc_busyness_running_state(guc);
>> +
>
>Why not to move the entire __reset_guc_busyness_stats earlier then?

Not sure what you mean by __reset_guc_busyness_stats because that is 
only called when reset is in progress.

Do you mean __update_guc_busyness_stats()?

Only the running state needs to be updated for every gt_park. Updates to 
other stats can be throttled based on the jiffies code in 
intel_guc_busyness_park().

Thanks,
Umesh

>
>>  	/*
>>  	 * There is a race with suspend flow where the worker runs after suspend
>>  	 * and causes an unclaimed register access warning. Cancel the worker
>> --
>> 2.34.1
>>


More information about the Intel-gfx mailing list