[Intel-gfx] [PATCH] drm/i915: Migrate stolen objects before hibernation

Tue Jun 30 04:54:14 PDT 2015

On Tue, Jun 30, 2015 at 12:32:38PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 01:22:59PM +0200, Daniel Vetter wrote:
> > On Tue, Jun 30, 2015 at 12:03:44PM +0100, Chris Wilson wrote:
> > > On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > > > +	memset(&node, 0, sizeof(node));
> > > > > +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> > > > > +						  &node,
> > > > > +						  4096, 0, I915_CACHE_NONE,
> > > > > +						  0, i915->gtt.mappable_end,
> > > > > +						  DRM_MM_SEARCH_DEFAULT,
> > > > > +						  DRM_MM_CREATE_DEFAULT);
> > > > 
> > > > Hm, I think the plan with stolen is to mostly use it for giant scanout
> > > > buffers where we never plan to access them with the gpu. Maybe go with a
> > > > per-page loop here instead? You have a low-level pte writing call below
> > > > anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
> > > > won't hurt.
> > > 
> > > I'm not understanding. This is a per-page loop (because we don't need to
> > > bind the entire stolen vma into GGTT for copying with the CPU and
> > > thereby increase the risk of failure). Speaking of failure, should
> > > hibernation be interruptible? I guess it is usually called from an
> > > interruptible process context.
> > 
> > I was blind and confused by the insert_entries we have in upstream, which
> > takes a sg_table and hence can only map the full view without some jumping
> > through hoops. Concern fully addressed already ;-)
> > 
> > Wrt uninterruptible: GPU should be idle already completely (and reset if
> > something went wrong) so no need for interruptible.
> 
> Note that I put the migration loop before the suspend, i.e. before the
> gpu_idle. Partly because, I felt the migration has the biggest chance of
> failure so should go first, and the gpu idle in suspend is quite
> convenient if we do use the GPU for blitting, but mainly because
> after i915_gem_suspend() doing GEM operations feels very wrong (there is
> a strong possibilty that we kick off some work queue or other that must
> be idle).

Hm, conceptually I think this should be part of i915_gem_suspend, but
means we'd get to wire a bool hibernate around a lot. And I also think we
should do this after gpu idle and forget about optimizations like using
the blitter - too much of a slowpath really to matter.

> > Hm, thinking about
> > this: Do we handle a gpu death only detected in gpu_idle? Nasty igt:
> > - inject hang, but be very careful to not cause any wait at all
> > - suspend
> > 
> > BOOM or not?
> 
> In my kernels, no boom. GPU hang waiting for idle is business as usual!
> In upstream, we have seen suspend/hibernate fail due to an untimely hang
> (iirc, usually worked on the second attempt so the bug report in
> question was about something else entirely except the logs contained the
> hibernate failure).

Ok, reality still matches with my expectations then. I'll wish for an igt
and your fix extracted ;-)

Without looking, do you just lock-drop and then retry (and block in the
interruptible acquisition of dev->struct_mutex)?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch