[Intel-gfx] [CI 1/5] drm/i915/gt: Try to more gracefully quiesce the system before resets
Ramalingam C
ramalingam.c at intel.com
Thu Oct 24 01:32:01 UTC 2019
On 2019-10-23 at 14:31:04 +0100, Chris Wilson wrote:
> If we are doing a normal GPU reset triggered after detecting a long
> period of stalled work, we can take our time and allow the engines to
> quiesce. Since we've stopped submission to the engine, and if we wait
> long enough an innocent context should complete, leaving the engine idle.
> So by waiting a short amount of time, we should prevent clobbering other
> users when resetting a stuck context.
>
> Suggested-by: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Suggested-by: Jon Bloomfield <jon.bloomfield at intel.com>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> ---
> drivers/gpu/drm/i915/Kconfig.profile | 11 +++++++++++
> drivers/gpu/drm/i915/gt/intel_engine_cs.c | 20 +++++++++++++++++++-
> drivers/gpu/drm/i915/gt/intel_engine_types.h | 4 ++++
> 3 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 48df8889a88a..3a3881d5e44b 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
> May be 0 to disable the initial spin. In practice, we estimate
> the cost of enabling the interrupt (if currently disabled) to be
> a few microseconds.
> +
> +config DRM_I915_STOP_TIMEOUT
> + int "How long to wait for an engine to quiesce gracefully before reset (ms)"
> + default 100 # milliseconds
> + help
> + By stopping submission and sleeping for a short time before resetting
> + the GPU, we allow the innocent contexts also on the system to quiesce.
> + It is then less likely for a hanging context to cause collateral
> + damage as the system is reset in order to recover. The corollary is
> + that the reset itself may take longer and so be more disruptive to
> + interactive or low latency workloads.
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 0e20713603ec..e4203eb44139 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -308,6 +308,9 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
> engine->instance = info->instance;
> __sprint_engine_name(engine);
>
> + engine->props.stop_timeout_ms =
> + CONFIG_DRM_I915_STOP_TIMEOUT;
Compare to previous version where you used the CONFIG variable directly,
what is the benefit of using it through a variable? So that we could
alter it in runtime?
-Ram
> +
> /*
> * To be overridden by the backend on setup. However to facilitate
> * cleanup on error during setup, we always provide the destroy vfunc.
> @@ -875,6 +878,21 @@ u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine)
> return bbaddr;
> }
>
> +static unsigned long stop_timeout(const struct intel_engine_cs *engine)
> +{
> + if (in_atomic() || irqs_disabled()) /* inside atomic preempt-reset? */
> + return 0;
> +
> + /*
> + * If we are doing a normal GPU reset, we can take our time and allow
> + * the engine to quiesce. We've stopped submission to the engine, and
> + * if we wait long enough an innocent context should complete and
> + * leave the engine idle. So they should not be caught unaware by
> + * the forthcoming GPU reset (which usually follows the stop_cs)!
> + */
> + return READ_ONCE(engine->props.stop_timeout_ms);
> +}
> +
> int intel_engine_stop_cs(struct intel_engine_cs *engine)
> {
> struct intel_uncore *uncore = engine->uncore;
> @@ -892,7 +910,7 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine)
> err = 0;
> if (__intel_wait_for_register_fw(uncore,
> mode, MODE_IDLE, MODE_IDLE,
> - 1000, 0,
> + 1000, stop_timeout(engine),
> NULL)) {
> GEM_TRACE("%s: timed out on STOP_RING -> IDLE\n", engine->name);
> err = -ETIMEDOUT;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 3451be034caf..87d5c4ef3ae7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -542,6 +542,10 @@ struct intel_engine_cs {
> */
> ktime_t total;
> } stats;
> +
> + struct {
> + unsigned long stop_timeout_ms;
> + } props;
> };
>
> static inline bool
> --
> 2.24.0.rc0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
More information about the Intel-gfx
mailing list