[Intel-gfx] [PATCH i-g-t v2 2/2] tests/gem_reset_stats: Add client ban test

Fri Oct 13 21:23:43 UTC 2017

Quoting Antonio Argenziano (2017-10-13 21:49:29)
> A client that submits 'bad' contexts will be banned eventually while
> other clients are not affected. Add a test for this.
> 
> v2
>         - Do not use a fixed number of contexts to ban client (Chris)
> 
> Signed-off-by: Antonio Argenziano <antonio.argenziano at intel.com>
> 
> Cc: Michel Thierry <michel.thierry at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> ---
>  tests/gem_reset_stats.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
> index edc40767..da309237 100644
> --- a/tests/gem_reset_stats.c
> +++ b/tests/gem_reset_stats.c
> @@ -438,6 +438,61 @@ static void test_ban_ctx(const struct intel_execution_engine *e)
>         close(fd);
>  }
>  
> +static void test_client_ban(const struct intel_execution_engine *e)
> +{
> +       int fd_bad,fd_good;
> +       struct local_drm_i915_reset_stats rs_bad, rs_good;
> +       int ban, ctx_bans_retry = 10;
> +       int client_ban, client_bans_retry = 10;
> +       uint32_t ctx_bad;
> +       uint32_t test_ctx;
> +       int active_count = 0;

/*
 * We try to prevent DoS GPU hang attacks from one client affecting the
 * service of other clients by banning the malicious client.
 *
 * To simulate this, we have one client submit a sequence of GPU hangs
 * that we expect to be banned (the client will no longer be allowed
 * to submit new work). Meanwhile a second client will not be affected
 * and allowed to continue.
 *
 * Note this is an extremely simplistic mechanism. For example, each
 * context inside an fd may represent a separate client (e.g. WebGL)
 * in which case the per-fd ban allow one client to DoS a second.
 * On the other hand, a malicious client may be creating a new context
 * for each hang to circumvent per-context restrictions. It may even be
 * creating a new fd each time to circumvent the per-fd restrictions.
 * Finally, the most malicious of all clients may succeed in wedging
 * the driver, bringing everyone to a standstill. In short, this DoS
 * prevention is itself suspect.
 *
 * So given the above what exactly to we want to define as the
 * expected behaviour? The answer is none of the above; use a
 * timeslicing scheduler and no banning. Any context is then allowed to
 * use as much as its time slice spinning as it wants; it won't be
 * allowed to steal GPU time from other processes. DoS averted.
 */

I keep coming to the conclusion that banning is a temporary hack, and
not mandatory uABI, like hangcheck. The usual answer is that we then
describe the implementation's behaviour through say a CONTEXT_GETPARAM
and let userspace factor that into account. But again, I to have ask
what requirements has userspace?
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_robustness.txt

(And how are the requirements going to change?)

The implication from that is that we could ban a context after the first
hang, but mesa isn't ready for that!

Maybe you should try writing an explanation of exactly what behaviour
you are trying to elicit here and why? :)

One thing that seems nice on the surface for execlists is repriotising
after a hang to allow all innocent parties to make progress at the
expense of the guilty. You will probably think of a few other approaches
that make sense as you think about what userspace expects.

> +       fd_bad = drm_open_driver(DRIVER_INTEL);
> +       fd_good = drm_open_driver(DRIVER_INTEL);
> +
> +       assert_reset_status(fd_bad, fd_bad, 0, RS_NO_ERROR);
> +       assert_reset_status(fd_good, fd_good, 0, RS_NO_ERROR);
> +
> +       while (client_bans_retry--) {
> +               client_ban = __gem_context_create(fd_bad, &ctx_bad);
> +               if (client_ban == -EIO)
> +                       break;
> +
> +               noop(fd_bad, ctx_bad, e);
> +               assert_reset_status(fd_bad, fd_bad, ctx_bad, RS_NO_ERROR);
> +
> +               ctx_bans_retry = 10;
> +               active_count = 0;
> +               while (ctx_bans_retry--) {
> +                       inject_hang(fd_bad, ctx_bad, e, BAN);
> +                       active_count++;
> +
> +                       ban = noop(fd_bad, ctx_bad, e);
> +                       if (ban == -EIO)
> +                               break;

There's a major variation in submitting a queue of hangs that also needs
to be taken into account (with different code paths in the kernel).
-Chris