[Intel-gfx] [PATCH 29/32] drm/i915: Apply an execution_mask to the virtual_engine
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Apr 17 13:32:03 UTC 2019
On 17/04/2019 13:46, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-04-17 13:35:29)
>>
>> On 17/04/2019 12:57, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-04-17 12:43:49)
>>>>
>>>> On 17/04/2019 08:56, Chris Wilson wrote:
>>>>> Allow the user to direct which physical engines of the virtual engine
>>>>> they wish to execute one, as sometimes it is necessary to override the
>>>>> load balancing algorithm.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> ---
>>>>> drivers/gpu/drm/i915/gt/intel_lrc.c | 58 +++++++++++
>>>>> drivers/gpu/drm/i915/gt/selftest_lrc.c | 131 +++++++++++++++++++++++++
>>>>> drivers/gpu/drm/i915/i915_request.c | 1 +
>>>>> drivers/gpu/drm/i915/i915_request.h | 3 +
>>>>> 4 files changed, 193 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> index d6efd6aa67cb..560a18bb4cbb 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>>> @@ -552,6 +552,18 @@ execlists_context_schedule_out(struct i915_request *rq, unsigned long status)
>>>>> intel_engine_context_out(rq->engine);
>>>>> execlists_context_status_change(rq, status);
>>>>> trace_i915_request_out(rq);
>>>>> +
>>>>> + /*
>>>>> + * If this is part of a virtual engine, its next request may have
>>>>> + * been blocked waiting for access to the active context. We have
>>>>> + * to kick all the siblings again in case we need to switch (e.g.
>>>>> + * the next request is not runnable on this engine). Hopefully,
>>>>> + * we will already have submitted the next request before the
>>>>> + * tasklet runs and do not need to rebuild each virtual tree
>>>>> + * and kick everyone again.
>>>>> + */
>>>>> + if (rq->engine != rq->hw_context->engine)
>>>>> + tasklet_schedule(&rq->hw_context->engine->execlists.tasklet);
>>>>
>>>> Is this needed only for non-default execution_mask? If so it would be
>>>> good to limit it to avoid tasklet storm with plain veng.
>>>
>>> The issue is not just with this rq but the next one. If that has a
>>> restricted mask that prevents it running on this engine, we may have
>>> missed the opportunity to queue it (and so never run it under just the
>>> right circumstances).
>>>
>>> Something like
>>> to_virtual_engine(rq->hw_context->engine)->request->execution_mask & ~rq->execution_mask
>>>
>>> The storm isn't quite so bad, it's only on context-out, and we often do
>>> succeed in keeping it busy. I was just trying to avoid pulling in ve here.
>>
>> What do you mean by the "pulling in ve" bit? Avoiding using
>> to_virtual_engine like in the line you wrote above?
>
> Just laziness hiding behind an excuse of trying to not to smear veng too
> widely.
>
>>>>> +
>>>>> + rq = READ_ONCE(ve->request);
>>>>> + if (!rq)
>>>>> + return 0;
>>>>> +
>>>>> + /* The rq is ready for submission; rq->execution_mask is now stable. */
>>>>> + mask = rq->execution_mask;
>>>>> + if (unlikely(!mask)) {
>>>>> + /* Invalid selection, submit to a random engine in error */
>>>>> + i915_request_skip(rq, -ENODEV);
>>>>
>>>> When can this happen? It looks like if it can happen we should reject it
>>>> earlier. Or if it can't then just assert.
>>>
>>> Many submit fences can end up with an interesection of 0. This is the
>>> convenient point to do the rejection, as with any other asynchronous
>>> error.
>>
>> Which ones are many? Why would we have uAPI which allows setting
>> impossible things where all requests will fail with -ENODEV?
>
> But we are rejecting them in the uAPI, right here. This is the earliest
> point where all the information for a particular execbuf is available
> and we have the means of reporting that back.
In the tasklet? I could be just extra slow today, but please could you
explain how we allowed a submission which can't be rejected earlier than
in the tasklet. What sequence of events leads to it?
>
>>>>> + mask = ve->siblings[0]->mask;
>>>>> + }
>>>>> +
>>>>> + GEM_TRACE("%s: rq=%llx:%lld, mask=%x, prio=%d\n",
>>>>> + ve->base.name,
>>>>> + rq->fence.context, rq->fence.seqno,
>>>>> + mask, ve->base.execlists.queue_priority_hint);
>>>>> +
>>>>> + return mask;
>>>>> +}
>>>>> +
>>>>> static void virtual_submission_tasklet(unsigned long data)
>>>>> {
>>>>> struct virtual_engine * const ve = (struct virtual_engine *)data;
>>>>> const int prio = ve->base.execlists.queue_priority_hint;
>>>>> + intel_engine_mask_t mask;
>>>>> unsigned int n;
>>>>>
>>>>> + rcu_read_lock();
>>>>> + mask = virtual_submission_mask(ve);
>>>>> + rcu_read_unlock();
>>>>
>>>> What is the RCU for?
>>>
>>> Accessing ve->request. There's nothing stopping another engine from
>>> spotting the ve->request still in its tree, submitting it and it being
>>> retired all during the read here.
>>
>> AFAIU there can only be one instance of virtual_submission_tasklet per
>> VE at a time and the code above is before the request is inserted into
>> physical engine trees. So I don't get it.
>
> But the veng is being utilized by real engines concurrently, they are
> who take the ve->request and execute it and so may free the ve->request
> behind the submission tasklet's back. Later on the spinlock comes into
> play after we have decided there's a request ready.
How can real engines see this request at this point since it hasn't been
put in the queue yet?
And if it is protecting against the tasklet then it should be
local_bh_disable/enable. But wait.. it is a tasklet already so that also
doesn't make sense.
So I just don't see it.
I guess it is related to the question of zero intersected mask. If that
would be impossible you would be able to fetch the mask from insige the
locked section in the hunk one down.
>
>> Hm.. but going back to the veng patch, there is a
>> GEM_BUG_ON(ve->request) in virtual_submit_request. Why couldn't this be
>> called multiple times in parallel for different requests?
>
> Because we strictly ordered submission into the veng so that it only
> considers one ready request at a time. Processing more requests
> decreased global throughput as load-balancing is no longer "late" (the
> physical engines then amalgamate all the ve requests into one submit).
I got temporarily confused into thinking submit_notify is at the
queued->runnable transition. You see what you are dealing with here. :I
Regards,
Tvrtko
More information about the Intel-gfx
mailing list