[Intel-gfx] [PATCH] drm/i915: Fix context ban and hang accounting for client

Chris Wilson chris at chris-wilson.co.uk
Fri Jun 15 10:28:12 UTC 2018


Quoting Mika Kuoppala (2018-06-15 11:18:28)
> If client is smart or lucky enough to create a new context
> after each hang, our context banning mechanism will never
> catch up, and as a result of that it will be saved from
> client banning. This can result in a never ending streak of
> gpu hangs caused by bad or malicious client, preventing
> access from other legit gpu clients.
> 
> Fix this by always incrementing per client ban score if
> it hangs in short successions regardless of context ban
> scoring. The exception are non bannable contexts. They remain
> detached from client ban scoring mechanism.
> 
> v2: xchg timestamp, tidyup (Chris)
> 
> Fixes: b083a0870c79 ("drm/i915: Add per client max context ban limit")
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Signed-off-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         | 20 ++++++---
>  drivers/gpu/drm/i915/i915_gem.c         | 57 +++++++++++++++++--------
>  drivers/gpu/drm/i915/i915_gem_context.c |  2 +-
>  3 files changed, 54 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 74dd88d8563e..93aa8e7dfaba 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -352,14 +352,20 @@ struct drm_i915_file_private {
>  
>         unsigned int bsd_engine;
>  
> -/* Client can have a maximum of 3 contexts banned before
> - * it is denied of creating new contexts. As one context
> - * ban needs 4 consecutive hangs, and more if there is
> - * progress in between, this is a last resort stop gap measure
> - * to limit the badly behaving clients access to gpu.
> +/* Every context ban increments per client ban score. Also

/*
 * Every

One day we'll have rewritten every line of code, and every comment.

> + * hangs in short succession increments ban score. If client suffers 3
> + * context bans, 9 hangs in quick succession or combination of those,

Leave out the numbers if possible, just explain the rationale of the
multilevel system.

> + * it is banned and submitting more work will fail. This is a stop gap
> + * measure to limit the badly behaving clients access to gpu.
> + * Note that unbannable contexts never increment the client ban score.
>   */
> -#define I915_MAX_CLIENT_CONTEXT_BANS 3
> -       atomic_t context_bans;
> +#define I915_CLIENT_SCORE_HANG_FAST    1
> +#define   I915_CLIENT_FAST_HANG_JIFFIES (60 * HZ)
> +#define I915_CLIENT_SCORE_CONTEXT_BAN   3
> +#define I915_CLIENT_SCORE_BANNED       9
> +       /** ban_score: Accumulated score of all ctx bans and fast hangs. */
> +       atomic_t ban_score;
> +       unsigned long hang_timestamp;
>  };
>  
>  /* Interface history:
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 8dd4d35655af..f06fe1c636e5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2941,32 +2941,54 @@ i915_gem_object_pwrite_gtt(struct drm_i915_gem_object *obj,
>         return 0;
>  }
>  
> +static void i915_gem_client_mark_guilty(struct drm_i915_file_private *file_priv,
> +                                       const struct i915_gem_context *ctx)
> +{
> +       unsigned int score;
> +       unsigned long prev_hang;
> +
> +       if (i915_gem_context_is_banned(ctx))
> +               score = I915_CLIENT_SCORE_CONTEXT_BAN;
> +       else
> +               score = 0;
> +
> +       prev_hang = xchg(&file_priv->hang_timestamp, jiffies);
> +       if (time_before(jiffies, prev_hang + I915_CLIENT_FAST_HANG_JIFFIES))
> +               score += I915_CLIENT_SCORE_HANG_FAST;
> +
> +       if (score) {
> +               atomic_add(score, &file_priv->ban_score);
> +
> +               DRM_DEBUG_DRIVER("client %s: gained %u ban score, now %u\n",
> +                                ctx->name, score,
> +                                atomic_read(&file_priv->ban_score));
> +       }
> +}
> +
>  static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
>  {
> -       bool banned;
> +       unsigned int score;
> +       bool banned, bannable;
>  
>         atomic_inc(&ctx->guilty_count);
>  
> -       banned = false;
> -       if (i915_gem_context_is_bannable(ctx)) {
> -               unsigned int score;
> +       bannable = i915_gem_context_is_bannable(ctx);
> +       score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> +       banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>  
> -               score = atomic_add_return(CONTEXT_SCORE_GUILTY,
> -                                         &ctx->ban_score);
> -               banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
> +       DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, ban %s:%s\n",
> +                        ctx->name, atomic_read(&ctx->guilty_count),
> +                        score, yesno(bannable), yesno(banned));

ban: no:yes

Wut? Maybe just yesno(banned && bannable) as even debug messages
shouldn't strive to confuse us further.

Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
-Chris


More information about the Intel-gfx mailing list