[Intel-gfx] [PATCH v6 00/34] GPU scheduler for i915 driver

Mon Apr 25 11:55:22 UTC 2016

On 25/04/2016 10:54, Chris Wilson wrote:
> On Wed, Apr 20, 2016 at 06:13:18PM +0100, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> Implemented a batch buffer submission scheduler for the i915 DRM driver.
>>
>> The general theory of operation is that when batch buffers are
>> submitted to the driver, the execbuffer() code assigns a unique seqno
>> value and then packages up all the information required to execute the
>> batch buffer at a later time. This package is given over to the
>> scheduler which adds it to an internal node list.
> That is called the request. Please use it properly. We pass a request to
> the engine. The engine can hook up the scheduler in which case it may
> defer execution of that request until the scheduler says it is ready.
> But that is a conversation inside the engine, not intermingled with GEM.
It looks like I forgot to update that part of the description. The 
scheduler and its cover letter were written before the conversion of the 
driver from using seqnos everywhere to using requests. Indeed, that 
conversion was implemented as prep-work for the scheduler. So, yes the 
description should be updated to match the code and say 'assigns a 
request structure' not a seqno.

> GEM needs to only care about its unique seqno along the timeline. We can
> leave space for the scheduler to inject a second seqno if that is
> convenient, but since the scheduler can keep a global list of inflight
> requests, its retirement is also globally ordered even if we only
> inspect each request wrt to its own timeline.
Currently, there is no 'own timeline'. The seqno timeline is global to 
the entire driver and not per engine, per context, or whatever. The 
conversion to more useful timelines still needs to be implemented. Until 
then, the scheduler must cope with the single global one.

>
>> The scheduler also
>> scans the list of objects associated with the batch buffer and
>> compares them against the objects already in use by other buffers in
>> the node list.
> This part moves the policy of the GEM API and enshrines it into the
> scheduler. Let GEM convert the implicit rules it has into explicit
> dependencies, on to which we can add the explicit dependencies passed in
> by the user. GEM has the information it needs to maintain its API, let's
> not let that silliness spread.
So how should this be done? The scheduler needs some mechanism for 
ensuring that requests are only submitted to the hardware in a suitable 
order. And that information needs to persist for as long as the request 
still exists in an uncompleted state.

>
>> If matches are found then the new batch buffer node is
>> marked as being dependent upon the matching node. The same is done for
>> the context object. The scheduler also bumps up the priority of such
>> matching nodes on the grounds that the more dependencies a given batch
>> buffer has the more important it is likely to be.
> Sounds like a rough cut at PI avoidance. This is easily abusable. It
> sounds like you want a prio/nice split. Is that a good idea to introduce
> in the first series?
Not sure how you mean that it can be abused?

>> The scheduler aims to have a given (tuneable) number of batch buffers
>> in flight on the hardware at any given time. If fewer than this are
>> currently executing when a new node is queued, then the node is passed
>> straight through to the submit function. Otherwise it is simply added
>> to the queue and the driver returns back to user land.
> Keep it in the scheduler, I see you decided to add EAGAIN to GEM execbuf
> when currently execbuf *blocks* in this situation. Don't use EAGAIN here
> since you just make userspace busyspin over a very, very heavyweight
> ioctl, and you have the power here to throttle userspace without
> blocking.
Not sure where you are looking. The above description does not talk 
about returning EAGAIN at any point. It says the batch buffer is either 
sent directly to the hardware (if the case of idle hardware) or added to 
the scheduler's queue (in the case of busy hardware). Either way, the 
execbuf IOCTL returns success to the user. There is a later patch in the 
series which adds throttling to prevent a user from submitting 
arbitrarily large numbers of slow batches to the scheduler and 
effectively consuming unbounded amounts of memory due to the scheduler's 
queue not being limited by the size of a ring buffer. However, the 
throttling is done by waiting on an outstanding request so as to block 
until the queue depth has been reduced. It only returns an error to the 
user in the case where the wait could not be done for some reason (e.g. 
wedged GPU).

>> As each batch buffer completes, it raises an interrupt which wakes up
>> the scheduler. Note that it is possible for multiple buffers to
>> complete before the IRQ handler gets to run. Further, the seqno values
>> of the individual buffers are not necessary incrementing as the
>> scheduler may have re-ordered their submission. However, the scheduler
>> keeps the list of executing buffers in order of hardware submission.
>> Thus it can scan through the list until a matching seqno is found and
>> then mark all in flight nodes from that point on as completed.
> No. You haven't built your timelines correctly. The idea behind the
> timeline is that a request can wait upon its seqno and check it against
> its own timeline, which is ordered only with other requests on its
> timeline. Because of this, it is independent of whatever order the
> scheduler executes the timelines in, each timeline is ordered.
>
> A request can simply wait for its timeline to advance, completely
> ignorant of the scheduler. (Request signaling may be driven by the
> scheduler, but that is a lowlevel, not GEM or dma-buf/fence,
> implementation detail. And only if the scheduler is coupled into the
> user-interupt, but on execlists it will be using the context-switch
> interrupt to driver itself, and for ringbuffer mode we have a choice of
> user-interrupt or using pipe-control/dw-notify to keep the paths
> separate.)
Again, you are assuming per context timelines which the driver does not 
yet have. That is planned work but is not yet implemented.

>
>> A deferred work queue is also poked by the interrupt handler.
> Wait you did all that in the interrupt handler. NO.
Actually, the comment is again out of date. Since the seqno -> request 
conversion series, the scheduler no longer has to worry about out of 
order seqno values from the hardware. So all of the above IRQ mess no 
longer occurs.

>
>> If a suitable node is found then it is sent to execbuff_final() for
>> submission to the hardware.
> That is an absolutely atrocious implementation. Don't pass it back up
> the layers when you already have all the information you need packaged in
> the request to submit it to hardware.
>
> execbuf -> engine->add_request -> [scheduler] -> engine->submit_request
>
> (conceptually, since what should happen is execlists gets it
> context-switch interrupt and then asks what set of requests to submit
> next. In a ringbuffer mode, a kthread/ktasklet would run off the
> interrupt ask which request to execute next and do the write into the
> ring for the switch into the new context and execution of the next batch,
> and then write the RING_TAIL.)
>
> At the fundamental level it looks like you have not introduced timelines
> correctly or introduced the scheduler as a separate entity for deciding
> which request to execute next (or if this request should preempt execution).
When this code was first written, there was not even the possibility of 
implementing per context timelines. Since that first version, there has 
been a lot of complex prep work (anti-OLR series, seqno to request 
conversion, struct fence conversion) which all makes the scheduler 
series a lot simpler. During that process, there has been amply 
opportunity to comment on the direction it was taking and to explain 
what could be improved and how.

The agreement on per context timelines is that while it would make 
things simpler still, it is yet more delay and could be left until after 
the current scheduler series has landed. Otherwise it is a never ending 
task to simply keep rebasing the scheduler patches onto a rapidly 
changing target let alone write entire extra patch series and rework the 
scheduler accordingly. Plus, we have multiple customers that require the 
scheduler functionality (and the pre-emption support that builds on top) 
so we need to get something shipping now rather than in another X years 
time.

> -Chris
>