[Intel-gfx] [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Feb 22 10:39:43 UTC 2022
On 22/02/2022 09:52, Tvrtko Ursulin wrote:
>
> On 18/02/2022 21:33, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> GuC converts the pre-emption timeout and timeslice quantum values into
>> clock ticks internally. That significantly reduces the point of 32bit
>> overflow. On current platforms, worst case scenario is approximately
>
> Where does 32-bit come from, the GuC side? We already use 64-bits so
> that something to fix to start with. Yep...
>
> ./gt/uc/intel_guc_fwif.h: u32 execution_quantum;
>
> ./gt/uc/intel_guc_submission.c: desc->execution_quantum =
> engine->props.timeslice_duration_ms * 1000;
>
> ./gt/intel_engine_types.h: unsigned long
> timeslice_duration_ms;
>
> timeslice_store/preempt_timeout_store:
> err = kstrtoull(buf, 0, &duration);
>
> So both kconfig and sysfs can already overflow GuC, not only because of
> tick conversion internally but because at backend level nothing was done
> for assigning 64-bit into 32-bit. Or I failed to find where it is handled.
>
>> 110 seconds. Rather than allowing the user to set higher values and
>> then get confused by early timeouts, add limits when setting these
>> values.
>
> Btw who is reviewing GuC patches these days - things have somehow gotten
> pretty quiet in activity and I don't think that's due absence of stuff
> to improve or fix? Asking since I think I noticed a few already which
> you posted and then crickets on the mailing list.
>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/intel_engine_cs.c | 15 +++++++++++++++
>> drivers/gpu/drm/i915/gt/sysfs_engines.c | 14 ++++++++++++++
>> drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h | 9 +++++++++
>> 3 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index e53008b4dd05..2a1e9f36e6f5 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -389,6 +389,21 @@ static int intel_engine_setup(struct intel_gt
>> *gt, enum intel_engine_id id,
>> if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
>> engine->props.preempt_timeout_ms = 0;
>> + /* Cap timeouts to prevent overflow inside GuC */
>> + if (intel_guc_submission_is_wanted(>->uc.guc)) {
>> + if (engine->props.timeslice_duration_ms >
>> GUC_POLICY_MAX_EXEC_QUANTUM_MS) {
>
> Hm "wanted".. There's been too much back and forth on the GuC load
> options over the years to keep track.. intel_engine_uses_guc work sounds
> like would work and read nicer.
>
> And limit to class instead of applying to all engines looks like a miss.
Sorry limit to class does not apply here, I confused this with the last
patch.
Regards,
Tvrtko
>
>> + drm_info(&engine->i915->drm, "Warning, clamping timeslice
>> duration to %d to prevent possibly overflow\n",
>> + GUC_POLICY_MAX_EXEC_QUANTUM_MS);
>> + engine->props.timeslice_duration_ms =
>> GUC_POLICY_MAX_EXEC_QUANTUM_MS;
>
> I am not sure logging such message during driver load is useful. Sounds
> more like a confused driver which starts with one value and then
> overrides itself. I'd just silently set the value appropriate for the
> active backend. Preemption timeout kconfig text already documents the
> fact timeouts can get overriden at runtime depending on platform+engine.
> So maybe just add same text to timeslice kconfig.
>
>> + }
>> +
>> + if (engine->props.preempt_timeout_ms >
>> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) {
>> + drm_info(&engine->i915->drm, "Warning, clamping
>> pre-emption timeout to %d to prevent possibly overflow\n",
>> + GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
>> + engine->props.preempt_timeout_ms =
>> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS;
>> + }
>> + }
>> +
>> engine->defaults = engine->props; /* never to change again */
>> engine->context_size = intel_engine_context_size(gt,
>> engine->class);
>> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> index 967031056202..f57efe026474 100644
>> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> @@ -221,6 +221,13 @@ timeslice_store(struct kobject *kobj, struct
>> kobj_attribute *attr,
>> if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> return -EINVAL;
>> + if (intel_uc_uses_guc_submission(&engine->gt->uc) &&
>> + duration > GUC_POLICY_MAX_EXEC_QUANTUM_MS) {
>> + duration = GUC_POLICY_MAX_EXEC_QUANTUM_MS;
>> + drm_info(&engine->i915->drm, "Warning, clamping timeslice
>> duration to %lld to prevent possibly overflow\n",
>> + duration);
>> + }
>
> I would suggest to avoid duplicated clamping logic. Maybe hide the all
> backend logic into the helpers then, like maybe:
>
> d = intel_engine_validate_timeslice/preempt_timeout(engine, duration);
> if (d != duration)
> return -EINVAL:
>
> Returning -EINVAL would be equivalent to existing behaviour:
>
> if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
> return -EINVAL;
>
> That way userspace has explicit notification and read-back is identical
> to written in value. From engine setup you can just call the helper
> silently.
>
>> +
>> WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
>> if (execlists_active(&engine->execlists))
>> @@ -325,6 +332,13 @@ preempt_timeout_store(struct kobject *kobj,
>> struct kobj_attribute *attr,
>> if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> return -EINVAL;
>> + if (intel_uc_uses_guc_submission(&engine->gt->uc) &&
>> + timeout > GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) {
>> + timeout = GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS;
>> + drm_info(&engine->i915->drm, "Warning, clamping pre-emption
>> timeout to %lld to prevent possibly overflow\n",
>> + timeout);
>> + }
>> +
>> WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
>> if (READ_ONCE(engine->execlists.pending[0]))
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index 6a4612a852e2..ad131092f8df 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -248,6 +248,15 @@ struct guc_lrc_desc {
>> #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>> +/*
>> + * GuC converts the timeout to clock ticks internally. Different
>> platforms have
>> + * different GuC clocks. Thus, the maximum value before overflow is
>> platform
>> + * dependent. Current worst case scenario is about 110s. So, limit to
>> 100s to be
>> + * safe.
>> + */
>> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS (100 * 1000)
>> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS (100 * 1000)
>
> Most important question -
> how will we know/notice if/when new GuC arrives where these timeouts
> would still overflow? Can this be queried somehow at runtime or where
> does the limit comes from? How is GuC told about it? Set in some field
> and it just allows too large values silently break things?
>
> Regards,
>
> Tvrtko
>
>> +
>> struct guc_policies {
>> u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>> /* In micro seconds. How much time to allow before DPC
>> processing is
More information about the Intel-gfx
mailing list