[Intel-gfx] [PATCH 2/9] drm/i915/gt: Close race between engine_park and intel_gt_retire_requests
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Nov 20 13:19:30 UTC 2019
On 20/11/2019 09:32, Chris Wilson wrote:
> The general concept was that intel_timeline.active_count was locked by
> the intel_timeline.mutex. The exception was for power management, where
> the engine->kernel_context->timeline could be manipulated under the
> global wakeref.mutex.
>
> This was quite solid, as we always manipulated the timeline only while
> we held an engine wakeref.
>
> And then we started retiring requests outside of struct_mutex, only
> using the timelines.active_list and the timeline->mutex. There we
> started manipulating intel_timeline.active_count outside of an engine
> wakeref, and so introduced a race between __engine_park() and
> intel_gt_retire_requests(), a race that could result in the
> engine->kernel_context not being added to the active timelines and so
> losing requests, which caused us to keep the system permanently powered
> up [and unloadable].
>
> The race would be easy to close if we could take the engine wakeref for
> the timeline before we retire -- except timelines are not bound to any
> engine and so we would need to keep all active engines awake. The
> alternative is to guard intel_timeline_enter/intel_timeline_exit for use
> outside of the timeline->mutex.
>
> Fixes: e5dadff4b093 ("drm/i915: Protect request retirement with timeline->mutex")
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gt_requests.c | 8 ++---
> drivers/gpu/drm/i915/gt/intel_timeline.c | 34 +++++++++++++++----
> .../gpu/drm/i915/gt/intel_timeline_types.h | 2 +-
> 3 files changed, 32 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> index 25291e2af21e..1a005da8c588 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> @@ -49,8 +49,8 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
> continue;
>
> intel_timeline_get(tl);
> - GEM_BUG_ON(!tl->active_count);
> - tl->active_count++; /* pin the list element */
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + atomic_inc(&tl->active_count); /* pin the list element */
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> if (timeout > 0) {
> @@ -71,14 +71,14 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
>
> /* Resume iteration after dropping lock */
> list_safe_reset_next(tl, tn, link);
> - if (!--tl->active_count)
> + if (atomic_dec_and_test(&tl->active_count))
> list_del(&tl->link);
>
> mutex_unlock(&tl->mutex);
>
> /* Defer the final release to after the spinlock */
> if (refcount_dec_and_test(&tl->kref.refcount)) {
> - GEM_BUG_ON(tl->active_count);
> + GEM_BUG_ON(atomic_read(&tl->active_count));
> list_add(&tl->link, &free);
> }
> }
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 0e277835aad0..b35f12729983 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -334,15 +334,33 @@ void intel_timeline_enter(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /*
> + * Pretend we are serialised by the timeline->mutex.
> + *
> + * While generally true, there are a few exceptions to the rule
> + * for the engine->kernel_context being used to manage power
> + * transitions. As the engine_park may be called from under any
> + * timeline, it uses the power mutex as a global serialisation
> + * lock to prevent any other request entering its timeline.
> + *
> + * The rule is generally tl->mutex, otherwise engine->wakeref.mutex.
> + *
> + * However, intel_gt_retire_request() does not know which engine
> + * it is retiring along and so cannot partake in the engine-pm
> + * barrier, and there we use the tl->active_count as a means to
> + * pin the timeline in the active_list while the locks are dropped.
> + * Ergo, as that is outside of the engine-pm barrier, we need to
> + * use atomic to manipulate tl->active_count.
> + */
> lockdep_assert_held(&tl->mutex);
> -
> GEM_BUG_ON(!atomic_read(&tl->pin_count));
> - if (tl->active_count++)
> +
> + if (atomic_add_unless(&tl->active_count, 1, 0))
> return;
> - GEM_BUG_ON(!tl->active_count); /* overflow? */
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_add_tail(&tl->link, &timelines->active_list);
> + if (!atomic_fetch_inc(&tl->active_count))
> + list_add_tail(&tl->link, &timelines->active_list);
> spin_unlock_irqrestore(&timelines->lock, flags);
> }
>
> @@ -351,14 +369,16 @@ void intel_timeline_exit(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /* See intel_timeline_enter() */
> lockdep_assert_held(&tl->mutex);
>
> - GEM_BUG_ON(!tl->active_count);
> - if (--tl->active_count)
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + if (atomic_add_unless(&tl->active_count, -1, 1))
> return;
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_del(&tl->link);
> + if (atomic_dec_and_test(&tl->active_count))
> + list_del(&tl->link);
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> /*
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline_types.h b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> index 98d9ee166379..5244615ed1cb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> @@ -42,7 +42,7 @@ struct intel_timeline {
> * from the intel_context caller plus internal atomicity.
> */
> atomic_t pin_count;
> - unsigned int active_count;
> + atomic_t active_count;
>
> const u32 *hwsp_seqno;
> struct i915_vma *hwsp_ggtt;
>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
Regards,
Tvrtko
More information about the Intel-gfx
mailing list