[Intel-gfx] [PATCH 3/9] drm/i915/gt: Unlock engine-pm after queuing the kernel context switch

Wed Nov 20 13:19:04 UTC 2019

666
On 20/11/2019 12:44, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-20 12:40:13)
>>
>> On 20/11/2019 12:07, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-11-20 11:58:27)
>>>>
>>>> On 20/11/2019 09:32, Chris Wilson wrote:
>>>>> In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
>>>>> the backend"), I erroneously concluded that we last modify the engine
>>>>> inside __i915_request_commit() meaning that we could enable concurrent
>>>>> submission for userspace as we enqueued this request. However, this
>>>>> falls into a trap with other users of the engine->kernel_context waking
>>>>> up and submitting their request before the idle-switch is queued, with
>>>>> the result that the kernel_context is executed out-of-sequence most
>>>>> likely upsetting the GPU and certainly ourselves when we try to retire
>>>>> the out-of-sequence requests.
>>>>>
>>>>> As such we need to hold onto the effective engine->kernel_context mutex
>>>>> lock (via the engine pm mutex proxy) until we have finish queuing the
>>>>> request to the engine.
>>>>>
>>>>> v2: Serialise against concurrent intel_gt_retire_requests()
>>>>> v3: Describe the hairy locking scheme with intel_gt_retire_requests()
>>>>> for future reference.
>>>>> v4: Combine timeline->lock and engine pm release; it's hairy.
>>>>>
>>>>> Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the backend")
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/intel_engine_pm.c | 47 +++++++++++++++++++----
>>>>>     1 file changed, 40 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> index 3c0f490ff2c7..1f517357a268 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> @@ -73,8 +73,25 @@ static inline void __timeline_mark_unlock(struct intel_context *ce,
>>>>>     
>>>>>     #endif /* !IS_ENABLED(CONFIG_LOCKDEP) */
>>>>>     
>>>>> +static void
>>>>> +__intel_timeline_enter_and_pm_release(struct intel_timeline *tl,
>>>>> +                                   struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     struct intel_gt_timelines *timelines = &engine->gt->timelines;
>>>>> +
>>>>> +     spin_lock(&timelines->lock);
>>>>> +
>>>>> +     if (!atomic_fetch_inc(&tl->active_count))
>>>>> +             list_add_tail(&tl->link, &timelines->active_list);
>>>>
>>>> Hmm with these new part it maybe matches/answers my question from
>>>> "drm/i915/gt: Close race between engine_park and
>>>> intel_gt_retire_requests". I think at least. Since it now adds a second
>>>> place timeline can enter the active_list.
>>>>
>>>> But no, where does the intel_timeline_enter race come from? Can't be
>>>> userspace submission since they are blocked on wf->lock.
>>>
>>> The race is not just with intel_timeline_enter, but with
>>> intel_gt_retire_requests.
>>>
>>> As soon as we are on that list, we may be retired. If we are retired
>>> before adjusting the engine->wakeref.count, we are b0rked.
>>>
>>> As soon as we adjust the engine->wakeref.count, another submission may
>>> call intel_timeline_enter, and again may even retire this request. The
>>> enter itself is perfectly fine, but we need to serialise against the
>>> retires.
>>
>> I think the two patches combined work, I am just not sure two
>> atomic_fetch_inc()->list_add() are needed now that you re-ordered
>> __i915_requests_queue and __intel_wakeref_defer_park - that's the part
>> which is confusing me. But it also doesn't harm...
> 
> I tried to get away with not, but the selftests hammer very heavily on
> the engine->kernel_context so we do encounter the scenarios where we are
> using the kernel_context to park on one cpu while submitting a new
> request on another.
> 
> I would have got away with it but for these pesky selftests.

Okay, I'll trust you on that:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko