[Intel-gfx] [PATCH] drm/i915: Fix context ban and hang accounting for client
Mika Kuoppala
mika.kuoppala at linux.intel.com
Fri Jun 15 13:41:15 UTC 2018
Chris Wilson <chris at chris-wilson.co.uk> writes:
> Quoting Mika Kuoppala (2018-06-15 11:18:28)
>> If client is smart or lucky enough to create a new context
>> after each hang, our context banning mechanism will never
>> catch up, and as a result of that it will be saved from
>> client banning. This can result in a never ending streak of
>> gpu hangs caused by bad or malicious client, preventing
>> access from other legit gpu clients.
>>
>> Fix this by always incrementing per client ban score if
>> it hangs in short successions regardless of context ban
>> scoring. The exception are non bannable contexts. They remain
>> detached from client ban scoring mechanism.
>>
>> v2: xchg timestamp, tidyup (Chris)
>>
>> Fixes: b083a0870c79 ("drm/i915: Add per client max context ban limit")
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 20 ++++++---
>> drivers/gpu/drm/i915/i915_gem.c | 57 +++++++++++++++++--------
>> drivers/gpu/drm/i915/i915_gem_context.c | 2 +-
>> 3 files changed, 54 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 74dd88d8563e..93aa8e7dfaba 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -352,14 +352,20 @@ struct drm_i915_file_private {
>>
>> unsigned int bsd_engine;
>>
>> -/* Client can have a maximum of 3 contexts banned before
>> - * it is denied of creating new contexts. As one context
>> - * ban needs 4 consecutive hangs, and more if there is
>> - * progress in between, this is a last resort stop gap measure
>> - * to limit the badly behaving clients access to gpu.
>> +/* Every context ban increments per client ban score. Also
>
> /*
> * Every
>
> One day we'll have rewritten every line of code, and every comment.
>
>> + * hangs in short succession increments ban score. If client suffers 3
>> + * context bans, 9 hangs in quick succession or combination of those,
>
> Leave out the numbers if possible, just explain the rationale of the
> multilevel system.
>
>> + * it is banned and submitting more work will fail. This is a stop gap
>> + * measure to limit the badly behaving clients access to gpu.
>> + * Note that unbannable contexts never increment the client ban score.
>> */
>> -#define I915_MAX_CLIENT_CONTEXT_BANS 3
>> - atomic_t context_bans;
>> +#define I915_CLIENT_SCORE_HANG_FAST 1
>> +#define I915_CLIENT_FAST_HANG_JIFFIES (60 * HZ)
>> +#define I915_CLIENT_SCORE_CONTEXT_BAN 3
>> +#define I915_CLIENT_SCORE_BANNED 9
>> + /** ban_score: Accumulated score of all ctx bans and fast hangs. */
>> + atomic_t ban_score;
>> + unsigned long hang_timestamp;
>> };
>>
>> /* Interface history:
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 8dd4d35655af..f06fe1c636e5 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2941,32 +2941,54 @@ i915_gem_object_pwrite_gtt(struct drm_i915_gem_object *obj,
>> return 0;
>> }
>>
>> +static void i915_gem_client_mark_guilty(struct drm_i915_file_private *file_priv,
>> + const struct i915_gem_context *ctx)
>> +{
>> + unsigned int score;
>> + unsigned long prev_hang;
>> +
>> + if (i915_gem_context_is_banned(ctx))
>> + score = I915_CLIENT_SCORE_CONTEXT_BAN;
>> + else
>> + score = 0;
>> +
>> + prev_hang = xchg(&file_priv->hang_timestamp, jiffies);
>> + if (time_before(jiffies, prev_hang + I915_CLIENT_FAST_HANG_JIFFIES))
>> + score += I915_CLIENT_SCORE_HANG_FAST;
>> +
>> + if (score) {
>> + atomic_add(score, &file_priv->ban_score);
>> +
>> + DRM_DEBUG_DRIVER("client %s: gained %u ban score, now %u\n",
>> + ctx->name, score,
>> + atomic_read(&file_priv->ban_score));
>> + }
>> +}
>> +
>> static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
>> {
>> - bool banned;
>> + unsigned int score;
>> + bool banned, bannable;
>>
>> atomic_inc(&ctx->guilty_count);
>>
>> - banned = false;
>> - if (i915_gem_context_is_bannable(ctx)) {
>> - unsigned int score;
>> + bannable = i915_gem_context_is_bannable(ctx);
>> + score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
>> + banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>>
>> - score = atomic_add_return(CONTEXT_SCORE_GUILTY,
>> - &ctx->ban_score);
>> - banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
>> + DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, ban %s:%s\n",
>> + ctx->name, atomic_read(&ctx->guilty_count),
>> + score, yesno(bannable), yesno(banned));
>
> ban: no:yes
>
> Wut? Maybe just yesno(banned && bannable) as even debug messages
> shouldn't strive to confuse us further.
>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
Fixed the above and pushed. Thanks for review!
-Mika
More information about the Intel-gfx
mailing list