[PATCH] drm/i915/pmu: Fix zero delta busyness issue

Umesh Nerlige Ramappa umesh.nerlige.ramappa at intel.com
Fri Jan 24 18:35:47 UTC 2025


On Fri, Jan 24, 2025 at 01:06:08PM -0500, Rodrigo Vivi wrote:
>On Thu, Jan 23, 2025 at 11:38:39AM -0800, Umesh Nerlige Ramappa wrote:
>> When running igt at gem_exec_balancer@individual for multiple iterations,
>> it is seen that the delta busyness returned by PMU is 0. The issue stems
>> from a combination of 2 implementation specific details:
>>
>> 1) gt_park is throttling __update_guc_busyness_stats() so that it does
>> not hog PCI bandwidth for some use cases. (Ref: 59bcdb564b3ba)
>>
>> 2) busyness implementation always returns monotonically increasing
>> counters. (Ref: cf907f6d29421)
>>
>> If an application queried an engine while it was active,
>> engine->stats.guc.running is set to true. Following that, if all PM
>> wakeref's are released, then gt is parked. At this time the throttling
>> of __update_guc_busyness_stats() may result in a missed update to the
>> running state of the engine (due to (1) above). This means subsequent
>> calls to guc_engine_busyness() will think that the engine is still
>> running and they will keep updating the cached counter (stats->total).
>> This results in an inflated cached counter.
>>
>> Later when the application runs a workload and queries for busyness, we
>> return the cached value since it is larger than the actual value (due to
>> (2) above)
>>
>> All subsequent queries will return the same large (inflated) value, so
>> the application sees a delta busyness of zero.
>>
>> Fix the issue by resetting the running state of engines each time
>> intel_guc_busyness_park() is called.
>>
>> v2: (Rodrigo)
>> - Use the correct tag in commit message
>> - Drop the redundant wakeref check in guc_engine_busyness() and update
>>   commit message
>
>Thank you
>
>>
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13366
>> Fixes: cf907f6d2942 ("i915/guc: Ensure busyness counter increases motonically")
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> ---
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 3b1333a24a89..a33b67b83dc1 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -1469,6 +1469,19 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
>>  	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>
>> +static void __update_guc_busyness_running_state(struct intel_guc *guc)
>
>I hate those '__', but I forgot to spot in the previous email and
>I know I know... there's a lot of those in this file already :/

Agree, I added it this time to keep the names consistent with existing 
mess (which was also created by me!). Otherwise, as per other reviewers' 
comments, I don't use them.

Some tips on when to use '__' would help.

>
>I was thinking of another name for this function as well since
>it is only stopping the running state, but I'm bad with naming...
>
>let's move on and close the real issue
>
>Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>

Thanks,
Umesh
>
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.guc.running = false;
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>>  static void __update_guc_busyness_stats(struct intel_guc *guc)
>>  {
>>  	struct intel_gt *gt = guc_to_gt(guc);
>> @@ -1619,6 +1632,9 @@ void intel_guc_busyness_park(struct intel_gt *gt)
>>  	if (!guc_submission_initialized(guc))
>>  		return;
>>
>> +	/* Assume no engines are running and set running state to false */
>> +	__update_guc_busyness_running_state(guc);
>> +
>>  	/*
>>  	 * There is a race with suspend flow where the worker runs after suspend
>>  	 * and causes an unclaimed register access warning. Cancel the worker
>> --
>> 2.38.1
>>


More information about the Intel-gfx mailing list