[Intel-gfx] [PATCH] drm/i915: Fix context ban and hang accounting for client

Mika Kuoppala mika.kuoppala at linux.intel.com
Fri Jun 15 13:41:15 UTC 2018


Chris Wilson <chris at chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-06-15 11:18:28)
>> If client is smart or lucky enough to create a new context
>> after each hang, our context banning mechanism will never
>> catch up, and as a result of that it will be saved from
>> client banning. This can result in a never ending streak of
>> gpu hangs caused by bad or malicious client, preventing
>> access from other legit gpu clients.
>> 
>> Fix this by always incrementing per client ban score if
>> it hangs in short successions regardless of context ban
>> scoring. The exception are non bannable contexts. They remain
>> detached from client ban scoring mechanism.
>> 
>> v2: xchg timestamp, tidyup (Chris)
>> 
>> Fixes: b083a0870c79 ("drm/i915: Add per client max context ban limit")
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.h         | 20 ++++++---
>>  drivers/gpu/drm/i915/i915_gem.c         | 57 +++++++++++++++++--------
>>  drivers/gpu/drm/i915/i915_gem_context.c |  2 +-
>>  3 files changed, 54 insertions(+), 25 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 74dd88d8563e..93aa8e7dfaba 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -352,14 +352,20 @@ struct drm_i915_file_private {
>>  
>>         unsigned int bsd_engine;
>>  
>> -/* Client can have a maximum of 3 contexts banned before
>> - * it is denied of creating new contexts. As one context
>> - * ban needs 4 consecutive hangs, and more if there is
>> - * progress in between, this is a last resort stop gap measure
>> - * to limit the badly behaving clients access to gpu.
>> +/* Every context ban increments per client ban score. Also
>
> /*
>  * Every
>
> One day we'll have rewritten every line of code, and every comment.
>
>> + * hangs in short succession increments ban score. If client suffers 3
>> + * context bans, 9 hangs in quick succession or combination of those,
>
> Leave out the numbers if possible, just explain the rationale of the
> multilevel system.
>
>> + * it is banned and submitting more work will fail. This is a stop gap
>> + * measure to limit the badly behaving clients access to gpu.
>> + * Note that unbannable contexts never increment the client ban score.
>>   */
>> -#define I915_MAX_CLIENT_CONTEXT_BANS 3
>> -       atomic_t context_bans;
>> +#define I915_CLIENT_SCORE_HANG_FAST    1
>> +#define   I915_CLIENT_FAST_HANG_JIFFIES (60 * HZ)
>> +#define I915_CLIENT_SCORE_CONTEXT_BAN   3
>> +#define I915_CLIENT_SCORE_BANNED       9
>> +       /** ban_score: Accumulated score of all ctx bans and fast hangs. */
>> +       atomic_t ban_score;
>> +       unsigned long hang_timestamp;
>>  };
>>  
>>  /* Interface history:
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 8dd4d35655af..f06fe1c636e5 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2941,32 +2941,54 @@ i915_gem_object_pwrite_gtt(struct drm_i915_gem_object *obj,
>>         return 0;
>>  }
>>  
>> +static void i915_gem_client_mark_guilty(struct drm_i915_file_private *file_priv,
>> +                                       const struct i915_gem_context *ctx)
>> +{
>> +       unsigned int score;
>> +       unsigned long prev_hang;
>> +
>> +       if (i915_gem_context_is_banned(ctx))
>> +               score = I915_CLIENT_SCORE_CONTEXT_BAN;
>> +       else
>> +               score = 0;
>> +
>> +       prev_hang = xchg(&file_priv->hang_timestamp, jiffies);
>> +       if (time_before(jiffies, prev_hang + I915_CLIENT_FAST_HANG_JIFFIES))
>> +               score += I915_CLIENT_SCORE_HANG_FAST;
>> +
>> +       if (score) {
>> +               atomic_add(score, &file_priv->ban_score);
>> +
>> +               DRM_DEBUG_DRIVER("client %s: gained %u ban score, now %u\n",
>> +                                ctx->name, score,
>> +                                atomic_read(&file_priv->ban_score));
>> +       }
>> +}
>> +
>>  static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
>>  {
>> -       bool banned;
>> +       unsigned int score;
>> +       bool banned, bannable;
>>  
>>         atomic_inc(&ctx->guilty_count);
>>  
>> -       banned = false;
>> -       if (i915_gem_context_is_bannable(ctx)) {
>> -               unsigned int score;
>> +       bannable = i915_gem_context_is_bannable(ctx);
>> +       score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
>> +       banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>>  
>> -               score = atomic_add_return(CONTEXT_SCORE_GUILTY,
>> -                                         &ctx->ban_score);
>> -               banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>> +       DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, ban %s:%s\n",
>> +                        ctx->name, atomic_read(&ctx->guilty_count),
>> +                        score, yesno(bannable), yesno(banned));
>
> ban: no:yes
>
> Wut? Maybe just yesno(banned && bannable) as even debug messages
> shouldn't strive to confuse us further.
>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>

Fixed the above and pushed. Thanks for review!
-Mika


More information about the Intel-gfx mailing list