[Bug 106680] [CI][SHARDS] igt at gem_exec_await@wide-contexts - fail/dmesg-fail - Failed assertion: !"GPU hung"
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Thu Oct 11 16:37:38 UTC 2018
https://bugs.freedesktop.org/show_bug.cgi?id=106680
Martin Peres <martin.peres at free.fr> changed:
What |Removed |Added
----------------------------------------------------------------------------
i915 platform| |CNL, CFL
Resolution|FIXED |---
Status|RESOLVED |REOPENED
--- Comment #8 from Martin Peres <martin.peres at free.fr> ---
(In reply to Chris Wilson from comment #7)
> commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f
> Author: Chris Wilson <chris at chris-wilson.co.uk>
> Date: Fri Sep 14 09:00:15 2018 +0100
>
> drm/i915: Limit the backpressure for i915_request allocation
>
> If we try and fail to allocate a i915_request, we apply some
> backpressure on the clients to throttle the memory allocations coming
> from i915.ko. Currently, we wait until completely idle, but this is far
> too heavy and leads to some situations where the only escape is to
> declare a client hung and reset the GPU. The intent is to only ratelimit
> the allocation requests and to allow ourselves to recycle requests and
> memory from any long queues built up by a client hog.
>
> Although the system memory is inherently a global resources, we don't
> want to overly penalize an unlucky client to pay the price of reaping a
> hog. To reduce the influence of one client on another, we can instead of
> waiting for the entire GPU to idle, impose a barrier on the local client.
> (One end goal for request allocation is for scalability to many
> concurrent allocators; simultaneous execbufs.)
>
> To prevent ourselves from getting caught out by long running requests
> (requests that may never finish without userspace intervention, whom we
> are blocking) we need to impose a finite timeout, ideally shorter than
> hangcheck. A long time ago Paul McKenney suggested that RCU users should
> ratelimit themselves using judicious use of cond_synchronize_rcu(). This
> gives us the opportunity to reduce our indefinite wait for the GPU to
> idle to a wait for the RCU grace period of the previous allocation along
> this timeline to expire, satisfying both the local and finite properties
> we desire for our ratelimiting.
>
> There are still a few global steps (reclaim not least amongst those!)
> when we exhaust the immediate slab pool, at least now the wait is itself
> decoupled from struct_mutex for our glorious highly parallel future!
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Link:
> https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-
> chris at chris-wilson.co.uk
>
> Pretty confident!
Unfortunately, this is still not fixed, as it is happening at the same rate as
before:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4948/shard-kbl2/igt@gem_exec_await@wide-contexts.html
Starting subtest: wide-contexts
(gem_exec_await:1079) igt_aux-CRITICAL: Test assertion failure function
sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1079) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.
And we also seen issues on WHL and CNL:
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_124/fi-cnl-u/igt@gem_exec_await@wide-contexts.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_121/fi-whl-u/igt@gem_exec_await@wide-contexts.html
--
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20181011/d40053b5/attachment.html>
More information about the intel-gfx-bugs
mailing list