[Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the ctx switch interrupt handler

Wed Jun 11 16:26:01 CEST 2014

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Wednesday, June 11, 2014 2:58 PM
> To: Mateo Lozano, Oscar
> Cc: Daniel Vetter; intel-gfx at lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the ctx switch
> interrupt handler
> 
> On Wed, Jun 11, 2014 at 12:01:42PM +0000, Mateo Lozano, Oscar wrote:
> > > -----Original Message-----
> > > From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of
> > > Daniel Vetter
> > > Sent: Wednesday, June 11, 2014 12:50 PM
> > > To: Mateo Lozano, Oscar
> > > Cc: intel-gfx at lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the
> > > ctx switch interrupt handler
> > >
> > > On Fri, May 09, 2014 at 01:09:19PM +0100, oscar.mateo at intel.com
> wrote:
> > > > From: Oscar Mateo <oscar.mateo at intel.com>
> > > >
> > > > If we receive a storm of requests for the same context (see
> > > > gem_storedw_loop_*) we might end up iterating over too many
> > > > elements in interrupt time, looking for contexts to squash
> > > > together. Instead, share the burden by giving more intelligence to
> > > > the queue function. At most, the interrupt will iterate over three
> elements.
> > > >
> > > > Signed-off-by: Oscar Mateo <oscar.mateo at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
> > > >  1 file changed, 20 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > > > b/drivers/gpu/drm/i915/intel_lrc.c
> > > > index d9edd10..0aad721 100644
> > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > @@ -410,9 +410,11 @@ int gen8_switch_context_queue(struct
> > > intel_engine *ring,
> > > >  			      struct i915_hw_context *to,
> > > >  			      u32 tail)
> > > >  {
> > > > +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > > >  	struct drm_i915_gem_request *req = NULL;
> > > >  	unsigned long flags;
> > > > -	bool was_empty;
> > > > +	struct drm_i915_gem_request *cursor;
> > > > +	int num_elements = 0;
> > > >
> > > >  	req = kzalloc(sizeof(*req), GFP_KERNEL);
> > > >  	if (req == NULL)
> > > > @@ -425,9 +427,24 @@ int gen8_switch_context_queue(struct
> > > intel_engine
> > > > *ring,
> > > >
> > > >  	spin_lock_irqsave(&ring->execlist_lock, flags);
> > > >
> > > > -	was_empty = list_empty(&ring->execlist_queue);
> > > > +	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
> > > > +		if (++num_elements > 2)
> > > > +			break;
> > > > +
> > > > +	if (num_elements > 2) {
> > > > +		struct drm_i915_gem_request *tail_req =
> > > > +				list_last_entry(&ring->execlist_queue,
> > > > +					struct drm_i915_gem_request,
> > > execlist_link);
> > > > +		if (to == tail_req->ctx) {
> > > > +			WARN(tail_req->elsp_submitted != 0,
> > > > +					"More than 2 already-submitted reqs
> > > queued\n");
> > > > +			list_del(&tail_req->execlist_link);
> > > > +			queue_work(dev_priv->wq, &tail_req->work);
> > > > +		}
> > > > +	}
> > >
> > > Completely forgotten to mention this: Chris&I discussed this on irc
> > > and I guess this issue will disappear if we track contexts instead
> > > of requests in the scheduler. I guess this is an artifact of the
> > > gen7 scheduler you've based this on, but even for that I think
> > > scheduling contexts (with preempt point after each batch) is the
> > > right approach. But I haven't dug out the scheduler patches again so
> might be wrong with that.
> > > -Daniel
> >
> > Hmmmm... I didn´t really base this on the scheduler. Some kind of
> > queue to hold context submissions until the hardware was ready was
> > needed, and queuing drm_i915_gem_requests seemed like a good choice
> at
> > the time (by the way, in the next version I am using a new struct
> > intel_ctx_submit_request, since I don´t need most of the fields in
> > drm_i915_gem_requests, and I have to add a couple of new ones anyway).
> >
> > What do you mean by "scheduling contexts"? Notice that the requests I
> > am queuing basically just contain the context and the tail at the
> > point it was submitted for execution...
> 
> Well I've thought we could just throw contexts at the hardware and throw
> new ones at it when the old ones get stuck/are completed. But now I've
> realized that since we do the cross-engine/ctx depency tracking in software
> it's not quite that easy and we can't unconditionally update the tail-pointer.

Exactly: unconditionally updating the tail-pointer also means the seqnos might get executed out of order, which is not nice (at least until there is a scheduler keeping track of the dependencies).

> Still for the degenerate case of one ctx submitting batches exclusively I've
> hoped just updating the tail pointer in the context and telling the hw to
> reload the current context should have been enough. Or at least I've hoped
> so, and that should take (mostly) care of the insane request overload case
> your patch here addresses.

What we had before this patch:

A) The submitter places one request per new batch received in the queue, regardless.
B) The interrupt handler traverses the queue backwards, squashing together all requests for the same context that come in a row. Notice that the first one in the queue might be already in execution, in which case the squashing will end up doing exactly what you said: updating the tail pointer.
C) When the ELSP is written to, if the same context was already in execution, the GPU performs a Lite Restore (the new tail is sampled and execution continues).

After this patch:

A) The submitter places one request per new batch received in the queue. But, if the last request in the queue belongs to the same context, and it´s not suspicious of being in-execution (which means it occupies the 3rd position or higher) then squash the two requests together (the old and the new).
B) The interrupt handler traverses the queue backwards, squashing together all requests for the same context that come in a row (maximum of two: one in execution, and the pending one).
C) When the ELSP is written to, if the same context was already in execution, the GPU performs a Lite Restore (the new tail is sampled and execution continues).

In a graphical form, if A and B are contexts and (*) means they are in execution:

B3 -> B2 B1* A1* (the submitter wants to queue B with tail 3)
B3 B1* A1*(as B2 is redundant, the submitter drops it and queues B3 in place)
B3 B1* (the interrupt informs us that A is complete, so we remove it from the head of the queue)
-> B3 (since B1 is redundant, we submit B3, causing a Lite Restore)

(* context in execution)