[Intel-gfx] [PATCH] drm/i915: Fix context ban and hang accounting for client
Chris Wilson
chris at chris-wilson.co.uk
Fri Jun 15 10:28:12 UTC 2018
Quoting Mika Kuoppala (2018-06-15 11:18:28)
> If client is smart or lucky enough to create a new context
> after each hang, our context banning mechanism will never
> catch up, and as a result of that it will be saved from
> client banning. This can result in a never ending streak of
> gpu hangs caused by bad or malicious client, preventing
> access from other legit gpu clients.
>
> Fix this by always incrementing per client ban score if
> it hangs in short successions regardless of context ban
> scoring. The exception are non bannable contexts. They remain
> detached from client ban scoring mechanism.
>
> v2: xchg timestamp, tidyup (Chris)
>
> Fixes: b083a0870c79 ("drm/i915: Add per client max context ban limit")
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Signed-off-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 20 ++++++---
> drivers/gpu/drm/i915/i915_gem.c | 57 +++++++++++++++++--------
> drivers/gpu/drm/i915/i915_gem_context.c | 2 +-
> 3 files changed, 54 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 74dd88d8563e..93aa8e7dfaba 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -352,14 +352,20 @@ struct drm_i915_file_private {
>
> unsigned int bsd_engine;
>
> -/* Client can have a maximum of 3 contexts banned before
> - * it is denied of creating new contexts. As one context
> - * ban needs 4 consecutive hangs, and more if there is
> - * progress in between, this is a last resort stop gap measure
> - * to limit the badly behaving clients access to gpu.
> +/* Every context ban increments per client ban score. Also
/*
* Every
One day we'll have rewritten every line of code, and every comment.
> + * hangs in short succession increments ban score. If client suffers 3
> + * context bans, 9 hangs in quick succession or combination of those,
Leave out the numbers if possible, just explain the rationale of the
multilevel system.
> + * it is banned and submitting more work will fail. This is a stop gap
> + * measure to limit the badly behaving clients access to gpu.
> + * Note that unbannable contexts never increment the client ban score.
> */
> -#define I915_MAX_CLIENT_CONTEXT_BANS 3
> - atomic_t context_bans;
> +#define I915_CLIENT_SCORE_HANG_FAST 1
> +#define I915_CLIENT_FAST_HANG_JIFFIES (60 * HZ)
> +#define I915_CLIENT_SCORE_CONTEXT_BAN 3
> +#define I915_CLIENT_SCORE_BANNED 9
> + /** ban_score: Accumulated score of all ctx bans and fast hangs. */
> + atomic_t ban_score;
> + unsigned long hang_timestamp;
> };
>
> /* Interface history:
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 8dd4d35655af..f06fe1c636e5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2941,32 +2941,54 @@ i915_gem_object_pwrite_gtt(struct drm_i915_gem_object *obj,
> return 0;
> }
>
> +static void i915_gem_client_mark_guilty(struct drm_i915_file_private *file_priv,
> + const struct i915_gem_context *ctx)
> +{
> + unsigned int score;
> + unsigned long prev_hang;
> +
> + if (i915_gem_context_is_banned(ctx))
> + score = I915_CLIENT_SCORE_CONTEXT_BAN;
> + else
> + score = 0;
> +
> + prev_hang = xchg(&file_priv->hang_timestamp, jiffies);
> + if (time_before(jiffies, prev_hang + I915_CLIENT_FAST_HANG_JIFFIES))
> + score += I915_CLIENT_SCORE_HANG_FAST;
> +
> + if (score) {
> + atomic_add(score, &file_priv->ban_score);
> +
> + DRM_DEBUG_DRIVER("client %s: gained %u ban score, now %u\n",
> + ctx->name, score,
> + atomic_read(&file_priv->ban_score));
> + }
> +}
> +
> static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
> {
> - bool banned;
> + unsigned int score;
> + bool banned, bannable;
>
> atomic_inc(&ctx->guilty_count);
>
> - banned = false;
> - if (i915_gem_context_is_bannable(ctx)) {
> - unsigned int score;
> + bannable = i915_gem_context_is_bannable(ctx);
> + score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> + banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>
> - score = atomic_add_return(CONTEXT_SCORE_GUILTY,
> - &ctx->ban_score);
> - banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
> + DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, ban %s:%s\n",
> + ctx->name, atomic_read(&ctx->guilty_count),
> + score, yesno(bannable), yesno(banned));
ban: no:yes
Wut? Maybe just yesno(banned && bannable) as even debug messages
shouldn't strive to confuse us further.
Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
-Chris
More information about the Intel-gfx
mailing list