[Intel-gfx] [PATCH v4 00/38] GPU scheduler for i915 driver

Mon Jan 11 15:52:52 PST 2016

On Mon, Jan 11, 2016 at 06:42:29PM +0000, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
> 
> Implemented a batch buffer submission scheduler for the i915 DRM driver.

I've lost track of the number of patches that are a result of not having
per-context seqno and could be eliminated.

> The general theory of operation is that when batch buffers are
> submitted to the driver, the execbuffer() code assigns a unique seqno
> value and then packages up all the information required to execute the
> batch buffer at a later time. This package is given over to the
> scheduler which adds it to an internal node list. The scheduler also
> scans the list of objects associated with the batch buffer and
> compares them against the objects already in use by other buffers in
> the node list. If matches are found then the new batch buffer node is
> marked as being dependent upon the matching node. The same is done for
> the context object. The scheduler also bumps up the priority of such
> matching nodes on the grounds that the more dependencies a given batch
> buffer has the more important it is likely to be.

The implicit synchronisation rules for GEM are best left for GEM.
Through the existing mechansim for synchronising requests, you can also
gather the information required to compute the dependency graph of the
new request. Adding the explicit synchronisation can then be done at the
same juncture.

> The scheduler aims to have a given (tuneable) number of batch buffers
> in flight on the hardware at any given time. If fewer than this are
> currently executing when a new node is queued, then the node is passed
> straight through to the submit function. Otherwise it is simply added
> to the queue and the driver returns back to user land.
> 
> As each batch buffer completes, it raises an interrupt which wakes up
> the scheduler. Note that it is possible for multiple buffers to
> complete before the IRQ handler gets to run. Further, the seqno values
> of the individual buffers are not necessary incrementing as the
> scheduler may have re-ordered their submission. However, the scheduler
> keeps the list of executing buffers in order of hardware submission.
> Thus it can scan through the list until a matching seqno is found and
> then mark all in flight nodes from that point on as completed.
> 
> A deferred work queue is also poked by the interrupt handler. When
> this wakes up it can do more involved processing such as actually
> removing completed nodes from the queue and freeing up the resources
> associated with them (internal memory allocations, DRM object
> references, context reference, etc.). The work handler also checks the
> in flight count and calls the submission code if a new slot has
> appeared.

No. Leave GEM code to GEM. Just handle scheduling of requests, avoid the
struct_mutex and let GEM tidy up after the requests it is tracking. Create
a kthread for the scheduler, long running tasks are not meant to be on the
system_wq. A kthread also allows you to set a rtpriority.

> When the scheduler's submit code is called, it scans the queued node
> list for the highest priority node that has no unmet dependencies.
> Note that the dependency calculation is complex as it must take
> inter-ring dependencies and potential preemptions into account. Note
> also that in the future this will be extended to include external
> dependencies such as the Android Native Sync file descriptors and/or
> the linux dma-buff synchronisation scheme.

(You can skip the note since it is just checking if a dependency is a
struct fence and whether that has been signalled, that is not any more
complex than the current request checking.)

> If a suitable node is found then it is sent to execbuff_final() for
> submission to the hardware. The in flight count is then re-checked and
> a new node popped from the list if appropriate.

That was the wrong callback to break up. You just wanted an
engine->submit_request(). But then if you look at execlists, you will
see a way to marry the two such that the scheduler has neglible overhead
above and beyond the already considerable overhead of execlistss. With
legacy, you will have to introduce the cost of interrupt driven
scheduling, but you can borrow an idea or two from execlists to mitigate
that somewhat (i.e. context switch interrupts rather than reusing the
user interrupt after every batch).

> The scheduler also allows high priority batch buffers (e.g. from a
> desktop compositor) to jump ahead of whatever is already running if
> the underlying hardware supports pre-emption. In this situation, any
> work that was pre-empted is returned to the queued list ready to be
> resubmitted when no more high priority work is outstanding.

You could actually demonstrate that in execlists without adding a full
blown scheduler.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre