[Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
Daniel Vetter
daniel at ffwll.ch
Fri Mar 26 09:10:13 UTC 2021
On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>
> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> post of a somewhat controversial feature, now upgraded to patch status.
>
> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> to ping it and so remain alive.
>
> I quote "restoring hangcheck" because this series, contrary to the old
> hangcheck, is not looking at whether the workload is making any progress from
> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> But apart from being prone to both false negatives and false positives I can't
> remember that myself.)
>
> Short version - ask is to fail any user submissions after a set time period. In
> this RFC that time is twelve seconds.
>
> Time counts from the moment user submission is "runnable" (implicit and explicit
> dependencies have been cleared) and keeps counting regardless of the GPU
> contetion caused by other users of the system.
>
> So semantics are really a bit weak, but again, I understand this is really
> really wanted by the DRM core even if I am not convinced it is a good idea.
>
> There are some dangers with doing this - text borrowed from a patch in the
> series:
>
> This can have an effect that workloads which used to work fine will
> suddenly start failing. Even workloads comprised of short batches but in
> long dependency chains can be terminated.
>
> And becuase of lack of agreement on usefulness and safety of fence error
> propagation this partial execution can be invisible to userspace even if
> it is "listening" to returned fence status.
>
> Another interaction is with hangcheck where care needs to be taken timeout
> is not set lower or close to three times the heartbeat interval. Otherwise
> a hang in any application can cause complete termination of all
> submissions from unrelated clients. Any users modifying the per engine
> heartbeat intervals therefore need to be aware of this potential denial of
> service to avoid inadvertently enabling it.
>
> Given all this I am personally not convinced the scheme is a good idea.
> Intuitively it feels object importers would be better positioned to
> enforce the time they are willing to wait for something to complete.
>
> v2:
> * Dropped context param.
> * Improved commit messages and Kconfig text.
>
> v3:
> * Log timeouts.
> * Bump timeout to 20s to see if it helps Tigerlake.
I think 20s is a bit much, and seems like problem is still there in igt. I
think we need look at that and figure out what to do with it. And then go
back down with the timeout somewhat again since 20s is quite a long time.
Irrespective of all the additional gaps/opens around watchdog timeout.
-Daniel
> * Fix sentinel assert.
>
> v4:
> * A round of review feedback applied.
>
> Chris Wilson (1):
> drm/i915: Individual request cancellation
>
> Tvrtko Ursulin (6):
> drm/i915: Extract active lookup engine to a helper
> drm/i915: Restrict sentinel requests further
> drm/i915: Handle async cancellation in sentinel assert
> drm/i915: Request watchdog infrastructure
> drm/i915: Fail too long user submissions by default
> drm/i915: Allow configuring default request expiry via modparam
>
> drivers/gpu/drm/i915/Kconfig.profile | 14 ++
> drivers/gpu/drm/i915/gem/i915_gem_context.c | 73 ++++---
> .../gpu/drm/i915/gem/i915_gem_context_types.h | 4 +
> drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +-
> drivers/gpu/drm/i915/gt/intel_context_types.h | 4 +
> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 +
> .../drm/i915/gt/intel_execlists_submission.c | 23 +-
> .../drm/i915/gt/intel_execlists_submission.h | 2 +
> drivers/gpu/drm/i915/gt/intel_gt.c | 3 +
> drivers/gpu/drm/i915/gt/intel_gt.h | 2 +
> drivers/gpu/drm/i915/gt/intel_gt_requests.c | 28 +++
> drivers/gpu/drm/i915/gt/intel_gt_types.h | 7 +
> drivers/gpu/drm/i915/i915_params.c | 5 +
> drivers/gpu/drm/i915/i915_params.h | 1 +
> drivers/gpu/drm/i915/i915_request.c | 129 ++++++++++-
> drivers/gpu/drm/i915/i915_request.h | 16 +-
> drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
> 17 files changed, 479 insertions(+), 45 deletions(-)
>
> --
> 2.27.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the Intel-gfx
mailing list