[Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

Tue May 6 21:27:37 CEST 2014

On Tue, May 06, 2014 at 05:46:01PM +0300, Imre Deak wrote:
> On Tue, 2014-05-06 at 12:59 +0100, Chris Wilson wrote:
> > On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote:
> > > On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
> > > > On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
> > > > > Currently user space can access GEM buffers mapped to GTT through
> > > > > existing mappings concurrently while the platform specific suspend
> > > > > handlers are running.  Since these handlers may change the HW state in a
> > > > > way that would break such accesses, remove the mappings before calling
> > > > > the handlers.
> > > > 
> > > > Hmm, but you never locked the device, so what is preventing those
> > > > concurrent accesses from refaulting in GTT entires anyway. Please explain
> > > > the context under which the runtime suspend code executes, and leave
> > > > that explanation within easy reach of intel_runtime_suspend() -
> > > > preferrably with testing of those assumptions.
> > > 
> > > During faulting we take an RPM reference, so that avoids concurrent
> > > re-faults. I could add a comment about this to the code.
> > 
> > You are still iterating over lists that are not safe, right? Or are
> > there more magic rpm references that prevent ioctls whilst
> > intel_runtime_suspend is being called?
> 
> Tbh I haven't checked this, since moving the unmapping earlier doesn't
> make a difference in this regard.
> 
> But it's a good point, I tried to audit now those paths. Currently the
> assumption is that we hold an RPM reference everywhere where those lists
> are changed. On the exec buffer path this is true, but for example in
> i915_drop_caches_set() it's not.
> 
> We could fix this by taking struct_mutex around
> i915_gem_release_all_mmaps() in intel_runtime_suspend(). Doing that
> needs some more auditing to make sure we can't deadlock somehow. At
> first glance it seems that the driver always schedules a deferred work
> and calls intel_runtime_suspend() from that, so I think it's fine.
> 
> I suggest applying this patch regardless since the two issues are
> separate.

If I understand the situation correctly the runtime suspend function only
ever gets called from a worker thread after the hysteris timeout expired.
Which means we should be able to wrap _just_ the gtt pte shotdown with
dev->struct_mutex and nothing else. Which is good since profileration of
dev->struct_mutex is awful.

On the resume side we don't need any locking since the gtt fault handler
will first grab the runtime reference and also dev->struct_mutex.

One issue which is looming is that this might deadlock. We might need a
trylock in the runtime suspend function and abort the runtime suspend if
we can't get the lock. Please test that lockdep catches this before we
commit to a design.

Just a very quick analysis, I didn't check the details so this might be
horribly wrong.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch