[Bug 106680] [CI][SHARDS] igt at gem_exec_await@wide-contexts - fail/dmesg-fail - Failed assertion: !"GPU hung"

Thu Oct 11 16:37:38 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=106680

Martin Peres <martin.peres at free.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      i915 platform|                            |CNL, CFL
         Resolution|FIXED                       |---
             Status|RESOLVED                    |REOPENED

--- Comment #8 from Martin Peres <martin.peres at free.fr> ---
(In reply to Chris Wilson from comment #7)
> commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f
> Author: Chris Wilson <chris at chris-wilson.co.uk>
> Date:   Fri Sep 14 09:00:15 2018 +0100
> 
>     drm/i915: Limit the backpressure for i915_request allocation
>     
>     If we try and fail to allocate a i915_request, we apply some
>     backpressure on the clients to throttle the memory allocations coming
>     from i915.ko. Currently, we wait until completely idle, but this is far
>     too heavy and leads to some situations where the only escape is to
>     declare a client hung and reset the GPU. The intent is to only ratelimit
>     the allocation requests and to allow ourselves to recycle requests and
>     memory from any long queues built up by a client hog.
>     
>     Although the system memory is inherently a global resources, we don't
>     want to overly penalize an unlucky client to pay the price of reaping a
>     hog. To reduce the influence of one client on another, we can instead of
>     waiting for the entire GPU to idle, impose a barrier on the local client.
>     (One end goal for request allocation is for scalability to many
>     concurrent allocators; simultaneous execbufs.)
>     
>     To prevent ourselves from getting caught out by long running requests
>     (requests that may never finish without userspace intervention, whom we
>     are blocking) we need to impose a finite timeout, ideally shorter than
>     hangcheck. A long time ago Paul McKenney suggested that RCU users should
>     ratelimit themselves using judicious use of cond_synchronize_rcu(). This
>     gives us the opportunity to reduce our indefinite wait for the GPU to
>     idle to a wait for the RCU grace period of the previous allocation along
>     this timeline to expire, satisfying both the local and finite properties
>     we desire for our ratelimiting.
>     
>     There are still a few global steps (reclaim not least amongst those!)
>     when we exhaust the immediate slab pool, at least now the wait is itself
>     decoupled from struct_mutex for our glorious highly parallel future!
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
>     Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>     Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>     Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-
> chris at chris-wilson.co.uk
> 
> Pretty confident!

Unfortunately, this is still not fixed, as it is happening at the same rate as
before:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4948/shard-kbl2/igt@gem_exec_await@wide-contexts.html

Starting subtest: wide-contexts
(gem_exec_await:1079) igt_aux-CRITICAL: Test assertion failure function
sig_abort, file ../lib/igt_aux.c:500:
(gem_exec_await:1079) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest wide-contexts failed.

And we also seen issues on WHL and CNL:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_124/fi-cnl-u/igt@gem_exec_await@wide-contexts.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_121/fi-whl-u/igt@gem_exec_await@wide-contexts.html

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20181011/d40053b5/attachment.html>