[Bug 101891] [BAT][BDW] WARN_ON(!intel_engines_are_idle(dev_priv)) in i915_gem_suspend+0x123/0x140

Fri Jul 28 16:18:34 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=101891

--- Comment #1 from Chris Wilson <chris at chris-wilson.co.uk> ---
It's just one of those impossible conditions that should never fire. The
sequence is this

        /* As the idle_work is rearming if it detects a race, play safe and
         * repeat the flush until it is definitely idle.
         */
        while (flush_delayed_work(&dev_priv->gt.idle_work))
                ;

        /* Assert that we sucessfully flushed all the work and
         * reset the GPU back to its idle, low power state.
         */
        WARN_ON(dev_priv->gt.awake);
        WARN_ON(!intel_engines_are_idle(dev_priv));

The idle work waits for idle engines and sets gt.awake=false. Then before
engines can be awoken, gt.awake=true. So we either have a race despite being in
a single threaded suspend context, or... I have no idea.

bool intel_engines_are_idle(struct drm_i915_private *dev_priv)
{
        struct intel_engine_cs *engine;
        enum intel_engine_id id;

        if (READ_ONCE(dev_priv->gt.active_requests))
                return false;

        /* If the driver is wedged, HW state may be very inconsistent and
         * report that it is still busy, even though we have stopped using it.
         */
        if (i915_terminally_wedged(&dev_priv->gpu_error))
                return true;

        for_each_engine(engine, dev_priv, id) {
                if (!intel_engine_is_idle(engine))
                        return false;
        }

        return true;
}

bool intel_engine_is_idle(struct intel_engine_cs *engine)
{
        struct drm_i915_private *dev_priv = engine->i915;

        /* More white lies, if wedged, hw state is inconsistent */
        if (i915_terminally_wedged(&dev_priv->gpu_error))
                return true;

        /* Any inflight/incomplete requests? */
        if (!i915_seqno_passed(intel_engine_get_seqno(engine),
                               intel_engine_last_submit(engine)))
                return false;

        if (I915_SELFTEST_ONLY(engine->breadcrumbs.mock))
                return true;

        /* Interrupt/tasklet pending? */
        if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
                return false;

        /* Both ports drained, no more ELSP submission? */
        if (port_request(&engine->execlist_port[0]))
                return false;

        /* ELSP is empty, but there are ready requests? */
        if (READ_ONCE(engine->execlist_first))
                return false;

        /* Ring stopped? */
        if (!ring_is_idle(engine))
                return false;

        return true;
}

It might be possible for an interrupt to kick in and dirty irq_posted, a very
late active->idle notification. Or the ring_is_idle() check on RING_MODE may be
garbage.

I'm going to go back and play the waiting game. Note for future self, consider
adding a WARN_ON(test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted));

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20170728/12746f74/attachment-0001.html>