[Intel-gfx] [PATCH 24/46] drm/i915: Do a synchronous switch-to-kernel-context on idling

Thu Feb 21 21:42:12 UTC 2019

Quoting Daniele Ceraolo Spurio (2019-02-21 21:31:45)
> 
> 
> On 2/21/19 1:17 PM, Chris Wilson wrote:
> > Quoting Daniele Ceraolo Spurio (2019-02-21 19:48:01)
> >>
> >> <snip>
> >>
> >>> @@ -4481,19 +4471,7 @@ int i915_gem_suspend(struct drm_i915_private *i915)
> >>>         * state. Fortunately, the kernel_context is disposable and we do
> >>>         * not rely on its state.
> >>>         */
> >>> -     if (!i915_terminally_wedged(&i915->gpu_error)) {
> >>> -             ret = i915_gem_switch_to_kernel_context(i915);
> >>> -             if (ret)
> >>> -                     goto err_unlock;
> >>> -
> >>> -             ret = i915_gem_wait_for_idle(i915,
> >>> -                                          I915_WAIT_INTERRUPTIBLE |
> >>> -                                          I915_WAIT_LOCKED |
> >>> -                                          I915_WAIT_FOR_IDLE_BOOST,
> >>> -                                          HZ / 5);
> >>> -             if (ret == -EINTR)
> >>> -                     goto err_unlock;
> >>> -
> >>> +     if (!switch_to_kernel_context_sync(i915)) { >                   /* Forcibly cancel outstanding work and leave the gpu quiet. */
> >>>                i915_gem_set_wedged(i915);
> >>>        }
> >>
> >> GuC-related question: what's your expectation here in regards to GuC
> >> status? The current i915 flow expect either uc_reset_prepare() or
> >> uc_suspend() to be called to clean up the guc status, but we're calling
> >> neither of them here if the switch is successful. Do you expect the
> >> resume code to always blank out the GuC status before a reload?
> > 
> > (A few patches later on I propose that we always just do a reset+wedge
> > on suspend in lieu of hangcheck.)
> > 
> > On resume, we have to bring the HW up from scratch and do another reset
> > in the process. Some platforms have been known to survive the trips to
> > PCI_D3 (someone is lying!) and so we _have_ to do a reset to be sure we
> > clear the HW state. I expect we would need to force a reset on resume
> > even for the guc, to be sure we cover all cases such as kexec.
> > -Chris
> > 
> More than about the HW state, my question here was about the SW 
> tracking. At which point do we go and stop guc communication and mark 
> guc as not loaded/accessible? e.g. we need to disable and re-enable CT 
> buffers before GuC is reset/suspended to make sure the shared memory 
> area is cleaned correctly (we currently avoid memsetting all of it on 
> reload since it is quite big). Also, communication with GuC is going to 
> increase going forward, so we'll need to make sure we accurately track 
> its state and do all the relevant cleanups.

Across suspend/resume, we issue a couple of resets and scrub/sanitize our
state tracking. By the time we load the fw again, both the fw and our
state should be starting from scratch.

That all seems unavoidable, so I am not understanding the essence of
your question.
-Chris