[Intel-gfx] [PATCH 04/43] drm/i915: Do a synchronous switch-to-kernel-context on idling
Chris Wilson
chris at chris-wilson.co.uk
Thu Mar 7 22:24:34 UTC 2019
Quoting Tvrtko Ursulin (2019-03-07 17:06:58)
>
> On 07/03/2019 13:29, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-03-07 13:07:18)
> >>
> >> On 06/03/2019 14:24, Chris Wilson wrote:
> >>> +static bool switch_to_kernel_context_sync(struct drm_i915_private *i915)
> >>> +{
> >>> + if (i915_gem_switch_to_kernel_context(i915))
> >>> + return false;
> >>
> >> Is it worth still trying to idle if this fails? Since the timeout is
> >> short, maybe reset in idle state bring less havoc than not. It can only
> >> fail on memory allocations I think, okay and terminally wedged. In which
> >> case it is still okay.
> >
> > Terminally wedged is hard wired to return 0 (this, the next patch?) so
> > that we don't bail during i915_gem_suspend() for this reason.
> >
> > We do still idle if this fails, as we mark the driver/GPU as wedged.
> > Perform a GPU reset so that it hopefully isn't shooting kittens any
> > more, and pull a pillow over our heads.
>
> I didn't find a path which idles before wedging if
> switch_to_kernel_context_sync fails due failing
> i915_gem_switch_to_kernel_context. Where is it?
Wedging implies idling. When we are wedged, the GPU is reset and left
pointing into the void (effectively idle, GT powersaving should be
unaffected by the wedge, don't ask how that works on ilk). All the
callers do
if (!switch_to_kernel_context_sync())
i915_gem_set_wedged()
with more or less intermediate steps. Hmm, given that is true why not
pull it into switch_to_kernel_context_sync()...
> It is a minor concern don't get me wrong. It is unlikely to fail like
> this. I was simply thinking why not try and wait for the current work to
> finish before suspending in this case. Might be a better experience
> after resume.
For the desktop use case, it's immaterial as the hotplug, reconfigure
and redraw take care of that. (Fbcon is also cleared.)
It's the server use case with persistent jobs that worries me; and I
expect will be upset unless we can do a preempt-to-idle beforehand. Are
suspend cycles even a factor? Probably not, but you never know with
thermald and host migration what they may get up to.
> > That's pretty much our answer in every case, wedge and hope the users go
> > away.
> >
> >>> +
> >>> + if (i915_gem_wait_for_idle(i915,
> >>> + I915_WAIT_LOCKED |
> >>> + I915_WAIT_FOR_IDLE_BOOST,
> >>> + HZ / 10))
> >>
> >> You'll presumably change HZ / 10 to that new macro.
> >
> > Yup.
> >
> >>> static void
> >>> i915_gem_idle_work_handler(struct work_struct *work)
> >>> {
> >>> - struct drm_i915_private *dev_priv =
> >>> - container_of(work, typeof(*dev_priv), gt.idle_work.work);
> >>> + struct drm_i915_private *i915 =
> >>> + container_of(work, typeof(*i915), gt.idle_work.work);
> >>> + typeof(i915->gt) *gt = &i915->gt;
> >>
> >> I am really not sure about the typeof idiom in normal C code. :( It
> >> saves a little bit of typing, and a little bit of churn if type name
> >> changes, but just feels weird to use it somewhere and somewhere not.
> >
> > But then we have to name thing! We're not sold on gt; it means quite a
> > few different things around the bspec. This bit is actually part that
> > I'm earmarking for i915_gem itself (high level idle/power/user management)
> > and I'm contemplating i915_gpu for the bits that beneath us but still a
> > management layer over hardware (and intel_foo for the bits that talk to
> > hardware. Maybe that too will change if we completely split out into
> > different modules.)
>
> So you could have left it as is for now and have a smaller diff. But
> okay.. have it if you insist.
No, you've stated on a few occasions that you don't like gt->X so I'll
have to find a new strategy and fixup patches as I remember your
distaste.
General plan is still to try and split drm_i915_private into i915_gem
and i915_gpu (for us at least).
> >>> /*
> >>> - * New request retired after this work handler started, extend active
> >>> - * period until next instance of the work.
> >>> + * Flush out the last user context, leaving only the pinned
> >>> + * kernel context resident. Should anything unfortunate happen
> >>> + * while we are idle (such as the GPU being power cycled), no users
> >>> + * will be harmed.
> >>> */
> >>> - if (new_requests_since_last_retire(dev_priv))
> >>> - goto out_unlock;
> >>> -
> >>> - __i915_gem_park(dev_priv);
> >>> + if (!gt->active_requests && !work_pending(>->idle_work.work)) {
> >>
> >> Could have left the new_requests_since_last_retire for readability.
> >
> > As always, it changed several times. I actually think considering that
> > we play games with active_requests, keeping that in the open is
> > important. ~o~
>
> Ok.
>
> >>
> >>> + ++gt->active_requests; /* don't requeue idle */
> >>> +
> >>> + if (!switch_to_kernel_context_sync(i915)) {
> >>> + dev_err(i915->drm.dev,
> >>> + "Failed to idle engines, declaring wedged!\n");
> >>> + GEM_TRACE_DUMP();
> >>> + i915_gem_set_wedged(i915);
> >>> + }
> >>> + i915_retire_requests(i915);
> >
> >>> @@ -3128,7 +3120,6 @@ int i915_gem_wait_for_idle(struct drm_i915_private *i915,
> >>> return err;
> >>>
> >>> i915_retire_requests(i915);
> >>> - GEM_BUG_ON(i915->gt.active_requests);
> >>
> >> Belongs to this patch?
> >
> > Yes. See the above chunk.
>
> The idle work handler? More hand holding required I'm afraid.
Ok, in the idle work, we bump active_requests before calling
i915_gem_wait_for_idle() to prevent active_requests == 0 here. We
intentionally hit the GEM_BUG_ON, in order to prevent requeuing the idle
work handler by emitting the request to switch to kernel context.
-Chris
More information about the Intel-gfx
mailing list