[Bug 106680] [CI][SHARDS] igt at gem_exec_await@wide-contexts - fail/dmesg-fail - Failed assertion: !"GPU hung"

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Fri Sep 14 11:04:43 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=106680

Chris Wilson <chris at chris-wilson.co.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #7 from Chris Wilson <chris at chris-wilson.co.uk> ---
commit 11abf0c5a021af683b8fe12b0d30fb1226d60e0f
Author: Chris Wilson <chris at chris-wilson.co.uk>
Date:   Fri Sep 14 09:00:15 2018 +0100

    drm/i915: Limit the backpressure for i915_request allocation

    If we try and fail to allocate a i915_request, we apply some
    backpressure on the clients to throttle the memory allocations coming
    from i915.ko. Currently, we wait until completely idle, but this is far
    too heavy and leads to some situations where the only escape is to
    declare a client hung and reset the GPU. The intent is to only ratelimit
    the allocation requests and to allow ourselves to recycle requests and
    memory from any long queues built up by a client hog.

    Although the system memory is inherently a global resources, we don't
    want to overly penalize an unlucky client to pay the price of reaping a
    hog. To reduce the influence of one client on another, we can instead of
    waiting for the entire GPU to idle, impose a barrier on the local client.
    (One end goal for request allocation is for scalability to many
    concurrent allocators; simultaneous execbufs.)

    To prevent ourselves from getting caught out by long running requests
    (requests that may never finish without userspace intervention, whom we
    are blocking) we need to impose a finite timeout, ideally shorter than
    hangcheck. A long time ago Paul McKenney suggested that RCU users should
    ratelimit themselves using judicious use of cond_synchronize_rcu(). This
    gives us the opportunity to reduce our indefinite wait for the GPU to
    idle to a wait for the RCU grace period of the previous allocation along
    this timeline to expire, satisfying both the local and finite properties
    we desire for our ratelimiting.

    There are still a few global steps (reclaim not least amongst those!)
    when we exhaust the immediate slab pool, at least now the wait is itself
    decoupled from struct_mutex for our glorious highly parallel future!

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680
    Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
    Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
    Link:
https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk

Pretty confident!

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20180914/acdc0ad4/attachment.html>


More information about the intel-gfx-bugs mailing list