[Intel-gfx] [PATCH 01/11] drm/i915/gt: Close race between engine_park and intel_gt_retire_requests

Sat Nov 16 18:21:19 UTC 2019

Quoting Chris Wilson (2019-11-16 17:54:38)
> Quoting Chris Wilson (2019-11-16 17:51:29)
> > The general concept was that intel_timeline.active_count was locked by
> > the intel_timeline.mutex. The exception was for power management, where
> > the engine->kernel_context->timeline could be manipulated under the
> > global wakeref.mutex.
> > 
> > This was quite solid, as we always manipulated the timeline only while
> > we held an engine wakeref.
> > 
> > And then we started retiring requests outside of struct_mutex, only
> > using the timelines.active_list and the timeline->mutex. There we
> > started manipulating intel_timeline.active_count outside of an engine
> > wakeref, and so introduced a race between __engine_park() and
> > intel_gt_retire_requests(), a race that could result in the
> > engine->kernel_context not being added to the active timelines and so
> > losing requests, which caused us to keep the system permanently powered
> > up [and unloadable].
> > 
> > The race would be easy to close if we could take the engine wakeref for
> > the timeline before we retire -- except timelines are not bound to any
> > engine and so we would need to keep all active engines awake. The
> > alternative is to guard intel_timeline_enter/intel_timeline_exit for use
> > outside of the timeline->mutex.
> 
> So the idea of taking a wakeref on all active engines during retire is
> growing on me. That's the less frequent path and so avoids adding
> another set of atomic manipulations to every request.

On second thought, no so easy as the set of active engines is
insufficient as new engines may become active as we wait. And the
thought of just taking all engines is a nuisance.
-Chris