[Intel-gfx] [PATCH i-g-t v2 2/2] tests/gem_reset_stats: Add client ban test
Chris Wilson
chris at chris-wilson.co.uk
Fri Oct 13 21:23:43 UTC 2017
Quoting Antonio Argenziano (2017-10-13 21:49:29)
> A client that submits 'bad' contexts will be banned eventually while
> other clients are not affected. Add a test for this.
>
> v2
> - Do not use a fixed number of contexts to ban client (Chris)
>
> Signed-off-by: Antonio Argenziano <antonio.argenziano at intel.com>
>
> Cc: Michel Thierry <michel.thierry at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> ---
> tests/gem_reset_stats.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 58 insertions(+)
>
> diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
> index edc40767..da309237 100644
> --- a/tests/gem_reset_stats.c
> +++ b/tests/gem_reset_stats.c
> @@ -438,6 +438,61 @@ static void test_ban_ctx(const struct intel_execution_engine *e)
> close(fd);
> }
>
> +static void test_client_ban(const struct intel_execution_engine *e)
> +{
> + int fd_bad,fd_good;
> + struct local_drm_i915_reset_stats rs_bad, rs_good;
> + int ban, ctx_bans_retry = 10;
> + int client_ban, client_bans_retry = 10;
> + uint32_t ctx_bad;
> + uint32_t test_ctx;
> + int active_count = 0;
/*
* We try to prevent DoS GPU hang attacks from one client affecting the
* service of other clients by banning the malicious client.
*
* To simulate this, we have one client submit a sequence of GPU hangs
* that we expect to be banned (the client will no longer be allowed
* to submit new work). Meanwhile a second client will not be affected
* and allowed to continue.
*
* Note this is an extremely simplistic mechanism. For example, each
* context inside an fd may represent a separate client (e.g. WebGL)
* in which case the per-fd ban allow one client to DoS a second.
* On the other hand, a malicious client may be creating a new context
* for each hang to circumvent per-context restrictions. It may even be
* creating a new fd each time to circumvent the per-fd restrictions.
* Finally, the most malicious of all clients may succeed in wedging
* the driver, bringing everyone to a standstill. In short, this DoS
* prevention is itself suspect.
*
* So given the above what exactly to we want to define as the
* expected behaviour? The answer is none of the above; use a
* timeslicing scheduler and no banning. Any context is then allowed to
* use as much as its time slice spinning as it wants; it won't be
* allowed to steal GPU time from other processes. DoS averted.
*/
I keep coming to the conclusion that banning is a temporary hack, and
not mandatory uABI, like hangcheck. The usual answer is that we then
describe the implementation's behaviour through say a CONTEXT_GETPARAM
and let userspace factor that into account. But again, I to have ask
what requirements has userspace?
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_robustness.txt
(And how are the requirements going to change?)
The implication from that is that we could ban a context after the first
hang, but mesa isn't ready for that!
Maybe you should try writing an explanation of exactly what behaviour
you are trying to elicit here and why? :)
One thing that seems nice on the surface for execlists is repriotising
after a hang to allow all innocent parties to make progress at the
expense of the guilty. You will probably think of a few other approaches
that make sense as you think about what userspace expects.
> + fd_bad = drm_open_driver(DRIVER_INTEL);
> + fd_good = drm_open_driver(DRIVER_INTEL);
> +
> + assert_reset_status(fd_bad, fd_bad, 0, RS_NO_ERROR);
> + assert_reset_status(fd_good, fd_good, 0, RS_NO_ERROR);
> +
> + while (client_bans_retry--) {
> + client_ban = __gem_context_create(fd_bad, &ctx_bad);
> + if (client_ban == -EIO)
> + break;
> +
> + noop(fd_bad, ctx_bad, e);
> + assert_reset_status(fd_bad, fd_bad, ctx_bad, RS_NO_ERROR);
> +
> + ctx_bans_retry = 10;
> + active_count = 0;
> + while (ctx_bans_retry--) {
> + inject_hang(fd_bad, ctx_bad, e, BAN);
> + active_count++;
> +
> + ban = noop(fd_bad, ctx_bad, e);
> + if (ban == -EIO)
> + break;
There's a major variation in submitting a queue of hangs that also needs
to be taken into account (with different code paths in the kernel).
-Chris
More information about the Intel-gfx
mailing list