[Intel-gfx] [PATCH v3 11/14] HACK drm/i915/scheduler: emulate a scheduler for guc

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu Dec 1 12:45:18 UTC 2016


On 01/12/2016 11:18, Chris Wilson wrote:
> On Thu, Dec 01, 2016 at 10:45:51AM +0000, Tvrtko Ursulin wrote:
>>
>> On 14/11/2016 08:57, Chris Wilson wrote:
>>> This emulates execlists on top of the GuC in order to defer submission of
>>> requests to the hardware. This deferral allows time for high priority
>>> requests to gazump their way to the head of the queue, however it nerfs
>>> the GuC by converting it back into a simple execlist (where the CPU has
>>> to wake up after every request to feed new commands into the GuC).
>>
>> As it is starting to sink in we'll have to do add this hack sooner
>> or later, review comments below.
>>
>> Also, would you be OK to rebase this or would prefer to delegate it?
>
> It's been very easy to keep it current.
>
>>> @@ -665,6 +665,70 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
>>> 	spin_unlock(&client->wq_lock);
>>> }
>>
>> Confused me at first here until I noticed engine->submit_request
>> will be the execlist_submit_request later. Perhaps it would be good
>> to rename a lot of things now, like engine->request_queue,
>> intel_engine_submit_request and maybe more?
>
> Looking at the lifecycle of the request and getting clear names for the
> phases and their vfuncs would be sensible. It may also be worthwhile to
> decide that some are not engine tasks and belong to a new entity
> (dev_priv->gt.X ?)

Like submit_request? Makes sense yes.

>>> +static bool i915_guc_dequeue(struct intel_engine_cs *engine)
>>> +{
>>> +	struct execlist_port *port = engine->execlist_port;
>>> +	struct drm_i915_gem_request *last = port[0].request;
>>> +	unsigned long flags;
>>> +	struct rb_node *rb;
>>> +	bool submit = false;
>>> +
>>> +	spin_lock_irqsave(&engine->timeline->lock, flags);
>>> +	rb = engine->execlist_first;
>>> +	while (rb) {
>>> +		struct drm_i915_gem_request *cursor =
>>> +			rb_entry(rb, typeof(*cursor), priotree.node);
>>> +
>>> +		if (last && cursor->ctx != last->ctx) {
>>
>> Not sure if GVT comes into the picture here, but it does not sounds
>> like it would harm to use can_merge_ctx here?
>
> I wasn't sure what path GVT would take either. So just went with the
> simple version that looked as similar to the current guc submission as
> possible. Also offloading the scheduling to the guc via semaphores will
> likely make this whole chain look completely different.

Hmm I am not up to speed with that. So you are saying it doesn't make 
sense to unify this?

>>> +			if (port != engine->execlist_port)
>>> +				break;
>>
>> It may be an overkill for the first version, but I was thinking that
>> we don't have to limit it to two at a time. And it would depend on
>> measuring of course. But perhaps it would make sense to do the
>> generalisation of the number of supported ports straight away.
>
> Definitely. I was just looking at a minimal conversion, hence reusing
> the existing tracking, and limits.

Definitely leave it for later, or definitely it makes sense to 
generalise right now? I was just thinking that when someone goes to test 
this and finds the throughput regresses, that it might be easier to just 
say please try i915.guc_submit_ports=8 or something.

>> We could theoretically share most of the execlist_dequeue and just
>> do a couple things differently depending on the mode.
>
> Yes, there were just a couple of intrusive warts that I felt justified
> in keeping the routines separate. But mostly it was the merge_ctx
> decision that looked to be backend specific.

Cool.

>> Looks like one could be a new engine->submit_ports vfunc. And there
>> is also the lite restore WA and sw signalling to design in nicely,
>> but it may be worth sharing the code. It would be renamed to
>> sometihng like scheduler_dequeue or something.
>
>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>> index cb8a75f6ca16..18dce4c66d56 100644
>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>> @@ -1341,8 +1341,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
>>> static __always_inline void
>>> gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>>> {
>>> -	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
>>> +	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) {
>>> +		tasklet_schedule(&engine->irq_tasklet);
>>
>> This would be better made conditional on GuC submission just to
>> calling tasklet_schedule twice (occasionally) in execlist mode.
>
> It's not a huge cost if we do schedule the tasklet twice (in the same
> interrupt context at least). Otherwise, yes we are incurring more mmio
> reads to determine that the CSB didn't advance. On the other hand, I
> really don't want to have if (i915.enable_guc_submission) here. Maybe
> if (engine->user_irq_tasklet) tasklet_schedule(engine->user_irq_tasklet)
> ?

Hm, I don't know. One conditional plus a redundant engine struct member 
or just one conditional. Maybe dev_priv->gt.handle_user_interrupt and 
dev_priv->gt.handle_ctx_switch_interrupt? You won't like the indirection 
I am sure. Go with the user_irq_tasklet then.

> Speaking of CSB, the spec keeps hinting that "CSB occupies a whole
> cacheline, making the read cheap". I'm not aware of how we can do a UC
> to cacheline read in a single transfer. Ideas?  movntdqa I thought was
> specifically WC.

Unfortunately I don't know. Low-level x86 is not my strong area. I only 
learnt of movntdqa when Akash and you started using it for example.

Regards,

Tvrtko


More information about the Intel-gfx mailing list