[Intel-gfx] [PATCH 04/19] drm/i915/gt: Unlock engine-pm after queuing the kernel context switch

Tue Nov 19 14:35:04 UTC 2019

On 18/11/2019 23:02, Chris Wilson wrote:
> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
> the backend"), I erroneously concluded that we last modify the engine
> inside __i915_request_commit() meaning that we could enable concurrent
> submission for userspace as we enqueued this request. However, this
> falls into a trap with other users of the engine->kernel_context waking
> up and submitting their request before the idle-switch is queued, with
> the result that the kernel_context is executed out-of-sequence most
> likely upsetting the GPU and certainly ourselves when we try to retire
> the out-of-sequence requests.
> 
> As such we need to hold onto the effective engine->kernel_context mutex
> lock (via the engine pm mutex proxy) until we have finish queuing the
> request to the engine.
> 
> v2: Serialise against concurrent intel_gt_retire_requests()
> 
> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 15 +++++++++------
>   1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index 3c0f490ff2c7..722d3eec226e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -75,6 +75,7 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
>   
>   static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   {
> +	struct intel_context *ce = engine->kernel_context;
>   	struct i915_request *rq;
>   	unsigned long flags;
>   	bool result = true;
> @@ -99,15 +100,13 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	 * retiring the last request, thus all rings should be empty and
>   	 * all timelines idle.
>   	 */
> -	flags = __timeline_mark_lock(engine->kernel_context);
> +	flags = __timeline_mark_lock(ce);
>   
> -	rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
> +	rq = __i915_request_create(ce, GFP_NOWAIT);
>   	if (IS_ERR(rq))
>   		/* Context switch failed, hope for the best! Maybe reset? */
>   		goto out_unlock;
>   
> -	intel_timeline_enter(i915_request_timeline(rq));
> -
>   	/* Check again on the next retirement. */
>   	engine->wakeref_serial = engine->serial + 1;
>   	i915_request_add_active_barriers(rq);
> @@ -116,13 +115,17 @@ static bool switch_to_kernel_context(struct intel_engine_cs *engine)
>   	rq->sched.attr.priority = I915_PRIORITY_BARRIER;
>   	__i915_request_commit(rq);
>   
> +	__i915_request_queue(rq, NULL);
> +
>   	/* Release our exclusive hold on the engine */
>   	__intel_wakeref_defer_park(&engine->wakeref);
> -	__i915_request_queue(rq, NULL);
> +
> +	/* And finally expose our selfselves to intel_gt_retire_requests() 

ourselves

*/
> +	intel_timeline_enter(ce->timeline);

I haven't really managed to follow this.

What are these other clients which can queue requests and what in this 
block prevents them doing so?

The change seems to be moving the queuing to before 
__intel_wakeref_defer_park and intel_timeline_enter to last. Wakeref 
defer extends the engine lifetime until the submitted rq is retired. But 
how is that consider "unlocking"?

Regards,

Tvrtko

>   
>   	result = false;
>   out_unlock:
> -	__timeline_mark_unlock(engine->kernel_context, flags);
> +	__timeline_mark_unlock(ce, flags);
>   	return result;
>   }
>   
>