[Intel-gfx] [PATCH] drm/i915: Clear pending reset requests during suspend

Tue Jan 19 09:18:07 PST 2016

On Tue, Jan 19, 2016 at 05:01:00PM +0000, Arun Siluvery wrote:
> On 19/01/2016 16:42, Daniel Vetter wrote:
> >On Tue, Jan 19, 2016 at 03:04:09PM +0000, Arun Siluvery wrote:
> >>On 19/01/2016 14:13, Chris Wilson wrote:
> >>>On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
> >>>>On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
> >>>>>On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> >>>>>>On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> >>>>>>>Pending reset requests are cleared before suspending, they should be picked up
> >>>>>>>after resume when new work is submitted.
> >>>>>>>
> >>>>>>>This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> >>>>>>>are under review, as suggested by Chris this is extracted as a separate patch
> >>>>>>>as it can be useful now.
> >>>>>>>
> >>>>>>>Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> >>>>>>>Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> >>>>>>
> >>>>>>Pulling in the discussion we had from irc: Imo the right approach is to
> >>>>>>simply wait for gpu reset to finish it's job. Since that could in turn
> >>>>>>lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> >>>>>>that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> >>>>>>E.g.
> >>>>>>
> >>>>>>while (busy) {
> >>>>>>	mutex_lock();
> >>>>>>	gpu_idle();
> >>>>>>	mutex_unlock();
> >>>>>>
> >>>>>>	flush_work(reset_work);
> >>>>>>}
> >>>>>
> >>>>>Where does the requirement for gpu_idle come from? If there is a global
> >>>>>reset in progress, it cannot queue a request to flush the work and
> >>>>>waiting on the old results will be skipped. So just wait for the global
> >>>>>reset to complete, i.e. flush_work().
> >>>>
> >>>>Yes, but the global reset might in turn leave a wrecked gpu behind, or at
> >>>>least a non-idle one. Hence another gpu_idle on top, to make sure. If we
> >>>>change init_hw() of engines to be synchronous then we should have at least
> >>>>a WARN_ON(not_idle_but_i_expected_so()); in there ...
> >>
> >>gpu_error.work is removed in b8d24a06568368076ebd5a858a011699a97bfa42, we
> >
> >git sha1 from your private tree are meaningless in the public. Either link
> >to some git weburl or mailing lists archive link.
> 
> It is from drm-intel repo,
> http://cgit.freedesktop.org/drm-intel/commit/?id=b8d24a06568368076ebd5a858a011699a97bfa42
> 
> http://lists.freedesktop.org/archives/intel-gfx/2015-January/059154.html

Oh right, forgot that this landed, sorry for the confusion.

Summary of our irc discussion: We idle the gpu and flush the hangcheck
(which should flush the reset work) so at least with current upstream
there shouldn't be a bug. If there is a bug we need to understand it, we
can't just add code without clear explanation and reasons: At best that
confuses, at worst it hides some real bugs.
-Daniel

> 
> regards
> Arun
> 
> >
> >Thanks, Daniel
> >
> >>are doing reset in hangcheck work itself so I think there is no need to
> >>flush work.
> >>
> >>while (i915_reset_in_progress(gpu_error) &&
> >>        !i915_terminally_wedged(gpu_error)) {
> >>         int ret;
> >>
> >>         mutex_lock(&dev->struct_mutex);
> >>         ret = i915_gpu_idle(dev);
> >>         if (ret)
> >>                 DRM_ERROR("GPU is in inconsistent state after reset\n");
> >>         mutex_unlock(&dev->struct_mutex);
> >>}
> >>
> >>If the reset is successful we are idle before suspend otherwise in a wedged
> >>state. is this ok?
> >>
> >>regards
> >>Arun
> >>
> >>>
> >>>Does it matter on suspend? We test on resume if the GPU is usable, but
> >>>if we wanted to test on suspend then we should do
> >>>
> >>>flush_work();
> >>>if (i915_terminally_wedged())
> >>>    /* oh noes */;
> >>>-Chris
> >>>
> >>
> >
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch