[Intel-gfx] [PATCH 13/20] drm/i915/guc: Relax CTB response timeout

Daniel Vetter daniel at ffwll.ch
Fri Jun 4 08:33:07 UTC 2021


On Wed, Jun 02, 2021 at 10:16:23PM -0700, Matthew Brost wrote:
> From: Michal Wajdeczko <michal.wajdeczko at intel.com>
> 
> In upcoming patch we will allow more CTB requests to be sent in
> parallel to the GuC for processing, so we shouldn't assume any more
> that GuC will always reply without 10ms.
> 
> Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.
> 
> v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option
> 
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> Reviewed-by: Matthew Brost <matthew.brost at intel.com>

So this is a rant, but for upstream we really need to do better than
internal:

- The driver must work by default in the optimal configuration.

- Any config change that we haven't validated _must_ taint the kernel
  (this is especially for module options, but also for config settings)

- Config need a real reason beyond "was useful for bring-up".

Our internal tree is an absolute disaster right now, with multi-line
kernel configs (different on each platform) and bespoke kernel config or
the driver just fails. We're the expert on our own hw, we should know how
it works, not offload that to users essentially asking them "how shitty do
you think Intel hw is in responding timely".

Yes I know there's a lot of these there already, they don't make a lot of
sense either.

Except if there's a real reason for this (aside from us just offloading
testing to our users instead of doing it ourselves properly) I think we
should hardcode this, with a comment explaining why. Maybe with a switch
between the PF/VF case once that's landed.

> ---
>  drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 39328567c200..0d5475b5f28a 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -38,6 +38,16 @@ config DRM_I915_USERFAULT_AUTOSUSPEND
>  	  May be 0 to disable the extra delay and solely use the device level
>  	  runtime pm autosuspend delay tunable.
>  
> +config DRM_I915_GUC_CTB_TIMEOUT
> +	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
> +	default 1500 # milliseconds
> +	range 10 60000

Also range is definitely off, drm/scheduler will probably nuke you
beforehand :-)

That's kinda another issue I have with all these kconfig knobs: Maybe we
need a knob for "relax with reset attempts, my workloads overload my gpus
routinely", which then scales _all_ timeouts proportionally. But letting
the user set them all, with silly combiniations like resetting the
workload before heartbeat or stuff like that doesn't make much sense.

Anyway, tiny patch so hopefully I can leave this one out for now until
we've closed this.
-Daniel

> +	help
> +	  Configures the default timeout waiting for GuC the to make forward
> +	  progress on CTBs. e.g. Waiting for a response to a requeset.
> +
> +	  A range of 10 ms to 60000 ms is allowed.
> +
>  config DRM_I915_HEARTBEAT_INTERVAL
>  	int "Interval between heartbeat pulses (ms)"
>  	default 2500 # milliseconds
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 916c2b80c841..cf1fb09ef766 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -436,6 +436,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   */
>  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  {
> +	long timeout;
>  	int err;
>  
>  	/*
> @@ -443,10 +444,12 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  	 * up to that length of time, then switch to a slower sleep-wait loop.
>  	 * No GuC command should ever take longer than 10ms.
>  	 */
> +	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
> +
>  #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
>  	err = wait_for_us(done, 10);
>  	if (err)
> -		err = wait_for(done, 10);
> +		err = wait_for(done, timeout);
>  #undef done
>  
>  	if (unlikely(err))
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list