[Intel-gfx] [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow

Tue Feb 22 10:39:43 UTC 2022

On 22/02/2022 09:52, Tvrtko Ursulin wrote:
> 
> On 18/02/2022 21:33, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> GuC converts the pre-emption timeout and timeslice quantum values into
>> clock ticks internally. That significantly reduces the point of 32bit
>> overflow. On current platforms, worst case scenario is approximately
> 
> Where does 32-bit come from, the GuC side? We already use 64-bits so 
> that something to fix to start with. Yep...
> 
> ./gt/uc/intel_guc_fwif.h:       u32 execution_quantum;
> 
> ./gt/uc/intel_guc_submission.c: desc->execution_quantum = 
> engine->props.timeslice_duration_ms * 1000;
> 
> ./gt/intel_engine_types.h:              unsigned long 
> timeslice_duration_ms;
> 
> timeslice_store/preempt_timeout_store:
> err = kstrtoull(buf, 0, &duration);
> 
> So both kconfig and sysfs can already overflow GuC, not only because of 
> tick conversion internally but because at backend level nothing was done 
> for assigning 64-bit into 32-bit. Or I failed to find where it is handled.
> 
>> 110 seconds. Rather than allowing the user to set higher values and
>> then get confused by early timeouts, add limits when setting these
>> values.
> 
> Btw who is reviewing GuC patches these days - things have somehow gotten 
> pretty quiet in activity and I don't think that's due absence of stuff 
> to improve or fix? Asking since I think I noticed a few already which 
> you posted and then crickets on the mailing list.
> 
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 15 +++++++++++++++
>>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 14 ++++++++++++++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++++++++
>>   3 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index e53008b4dd05..2a1e9f36e6f5 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -389,6 +389,21 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>       if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
>>           engine->props.preempt_timeout_ms = 0;
>> +    /* Cap timeouts to prevent overflow inside GuC */
>> +    if (intel_guc_submission_is_wanted(&gt->uc.guc)) {
>> +        if (engine->props.timeslice_duration_ms > 
>> GUC_POLICY_MAX_EXEC_QUANTUM_MS) {
> 
> Hm "wanted".. There's been too much back and forth on the GuC load 
> options over the years to keep track.. intel_engine_uses_guc work sounds 
> like would work and read nicer.
> 
> And limit to class instead of applying to all engines looks like a miss.

Sorry limit to class does not apply here, I confused this with the last 
patch.

Regards,

Tvrtko

> 
>> +            drm_info(&engine->i915->drm, "Warning, clamping timeslice 
>> duration to %d to prevent possibly overflow\n",
>> +                 GUC_POLICY_MAX_EXEC_QUANTUM_MS);
>> +            engine->props.timeslice_duration_ms = 
>> GUC_POLICY_MAX_EXEC_QUANTUM_MS;
> 
> I am not sure logging such message during driver load is useful. Sounds 
> more like a confused driver which starts with one value and then 
> overrides itself. I'd just silently set the value appropriate for the 
> active backend. Preemption timeout kconfig text already documents the 
> fact timeouts can get overriden at runtime depending on platform+engine. 
> So maybe just add same text to timeslice kconfig.
> 
>> +        }
>> +
>> +        if (engine->props.preempt_timeout_ms > 
>> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) {
>> +            drm_info(&engine->i915->drm, "Warning, clamping 
>> pre-emption timeout to %d to prevent possibly overflow\n",
>> +                 GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
>> +            engine->props.preempt_timeout_ms = 
>> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS;
>> +        }
>> +    }
>> +
>>       engine->defaults = engine->props; /* never to change again */
>>       engine->context_size = intel_engine_context_size(gt, 
>> engine->class);
>> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c 
>> b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> index 967031056202..f57efe026474 100644
>> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> @@ -221,6 +221,13 @@ timeslice_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>           return -EINVAL;
>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc) &&
>> +        duration > GUC_POLICY_MAX_EXEC_QUANTUM_MS) {
>> +        duration = GUC_POLICY_MAX_EXEC_QUANTUM_MS;
>> +        drm_info(&engine->i915->drm, "Warning, clamping timeslice 
>> duration to %lld to prevent possibly overflow\n",
>> +             duration);
>> +    }
> 
> I would suggest to avoid duplicated clamping logic. Maybe hide the all 
> backend logic into the helpers then, like maybe:
> 
>    d = intel_engine_validate_timeslice/preempt_timeout(engine, duration);
>    if (d != duration)
>      return -EINVAL:
> 
> Returning -EINVAL would be equivalent to existing behaviour:
> 
>      if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>          return -EINVAL;
> 
> That way userspace has explicit notification and read-back is identical 
> to written in value. From engine setup you can just call the helper 
> silently.
> 
>> +
>>       WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
>>       if (execlists_active(&engine->execlists))
>> @@ -325,6 +332,13 @@ preempt_timeout_store(struct kobject *kobj, 
>> struct kobj_attribute *attr,
>>       if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>           return -EINVAL;
>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc) &&
>> +        timeout > GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) {
>> +        timeout = GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS;
>> +        drm_info(&engine->i915->drm, "Warning, clamping pre-emption 
>> timeout to %lld to prevent possibly overflow\n",
>> +             timeout);
>> +    }
>> +
>>       WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
>>       if (READ_ONCE(engine->execlists.pending[0]))
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index 6a4612a852e2..ad131092f8df 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -248,6 +248,15 @@ struct guc_lrc_desc {
>>   #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>> +/*
>> + * GuC converts the timeout to clock ticks internally. Different 
>> platforms have
>> + * different GuC clocks. Thus, the maximum value before overflow is 
>> platform
>> + * dependent. Current worst case scenario is about 110s. So, limit to 
>> 100s to be
>> + * safe.
>> + */
>> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS        (100 * 1000)
>> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS    (100 * 1000)
> 
> Most important question -
> how will we know/notice if/when new GuC arrives where these timeouts 
> would still overflow? Can this be queried somehow at runtime or where 
> does the limit comes from? How is GuC told about it? Set in some field 
> and it just allows too large values silently break things?
> 
> Regards,
> 
> Tvrtko
> 
>> +
>>   struct guc_policies {
>>       u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>>       /* In micro seconds. How much time to allow before DPC 
>> processing is