[Intel-gfx] [PATCH v5 06/35] drm/i915: Start of GPU scheduler

Fri Feb 19 17:03:37 UTC 2016

On 19/02/2016 13:03, Joonas Lahtinen wrote:
> Hi,
>
> Now the code is in reviewable chunks, excellent!
>
> I've added my comments below. A few repeats from last round, but now
> with more questions about the logic itself.
>
> On to, 2016-02-18 at 14:26 +0000, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> Initial creation of scheduler source files. Note that this patch
>> implements most of the scheduler functionality but does not hook it in
>> to the driver yet. It also leaves the scheduler code in 'pass through'
>> mode so that even when it is hooked in, it will not actually do very
>> much. This allows the hooks to be added one at a time in bite size
>> chunks and only when the scheduler is finally enabled at the end does
>> anything start happening.
>>
>> The general theory of operation is that when batch buffers are
>> submitted to the driver, the execbuffer() code packages up all the
>> information required to execute the batch buffer at a later time. This
>> package is given over to the scheduler which adds it to an internal
>> node list. The scheduler also scans the list of objects associated
>> with the batch buffer and compares them against the objects already in
>> use by other buffers in the node list. If matches are found then the
>> new batch buffer node is marked as being dependent upon the matching
>> node. The same is done for the context object. The scheduler also
>> bumps up the priority of such matching nodes on the grounds that the
>> more dependencies a given batch buffer has the more important it is
>> likely to be.
>>
>> The scheduler aims to have a given (tuneable) number of batch buffers
>> in flight on the hardware at any given time. If fewer than this are
>> currently executing when a new node is queued, then the node is passed
>> straight through to the submit function. Otherwise it is simply added
>> to the queue and the driver returns back to user land.
>>
>> The scheduler is notified when each batch buffer completes and updates
>> its internal tracking accordingly. At the end of the completion
>> interrupt processing, if any scheduler tracked batches were processed,
>> the scheduler's deferred worker thread is woken up. This can do more
>> involved processing such as actually removing completed nodes from the
>> queue and freeing up the resources associated with them (internal
>> memory allocations, DRM object references, context reference, etc.).
>> The work handler also checks the in flight count and calls the
>> submission code if a new slot has appeared.
>>
>> When the scheduler's submit code is called, it scans the queued node
>> list for the highest priority node that has no unmet dependencies.
>> Note that the dependency calculation is complex as it must take
>> inter-ring dependencies and potential preemptions into account. Note
>> also that in the future this will be extended to include external
>> dependencies such as the Android Native Sync file descriptors and/or
>> the linux dma-buff synchronisation scheme.
>>
>> If a suitable node is found then it is sent to execbuff_final() for
>> submission to the hardware. The in flight count is then re-checked and
>> a new node popped from the list if appropriate. All nodes that are not
>> submitted have their priority bumped. This ensures that low priority
>> tasks do not get starved out by busy higher priority ones - everything
>> will eventually get its turn to run.
>>
>> Note that this patch does not implement pre-emptive scheduling. Only
>> basic scheduling by re-ordering batch buffer submission is currently
>> implemented. Pre-emption of actively executing batch buffers comes in
>> the next patch series.
>>
>> v2: Changed priority levels to +/-1023 due to feedback from Chris
>> Wilson.
>>
>> Removed redundant index from scheduler node.
>>
>> Changed time stamps to use jiffies instead of raw monotonic. This
>> provides lower resolution but improved compatibility with other i915
>> code.
>>
>> Major re-write of completion tracking code due to struct fence
>> conversion. The scheduler no longer has it's own private IRQ handler
>> but just lets the existing request code handle completion events.
>> Instead, the scheduler now hooks into the request notify code to be
>> told when a request has completed.
>>
>> Reduced driver mutex locking scope. Removal of scheduler nodes no
>> longer grabs the mutex lock.
>>
>> v3: Refactor of dependency generation to make the code more readable.
>> Also added in read-read optimisation support - i.e., don't treat a
>> shared read-only buffer as being a dependency.
>>
>> Allowed the killing of queued nodes rather than only flying ones.
>>
>> v4: Updated the commit message to better reflect the current state of
>> the code. Downgraded some BUG_ONs to WARN_ONs. Used the correct array
>> memory allocator function (kmalloc_array instead of kmalloc).
>> Corrected the format of some comments. Wrapped some lines differently
>> to keep the style checker happy.
>>
>> Fixed a WARN_ON when killing nodes. The dependency removal code checks
>> that nodes being destroyed do not have any oustanding dependencies
>> (which would imply they should not have been executed yet). In the
>> case of nodes being destroyed, e.g. due to context banning, then this
>> might well be the case - they have not been executed and do indeed
>> have outstanding dependencies.
>>
>> Re-instated the code to disble interrupts when not in use. The
>> underlying problem causing broken IRQ reference counts seems to have
>> been fixed now.
>>
>> v5: Shuffled various functions around to remove forward declarations
>> as apparently these are frowned upon. Removed lots of white space as
>> apparently having easy to read code is also frowned upon. Split the
>> direct submission scheduler bypass code out into a separate function.
>> Squashed down the i915_scheduler.c sections of various patches into
>> this patch. Thus the later patches simply hook in existing code into
>> various parts of the driver rather than adding the code as well. Added
>> documentation to various functions. Re-worked the submit function in
>> terms of mutex locking, error handling and exit paths. Split the
>> delayed work handler function in half. Made use of the kernel 'clamp'
>> macro. [Joonas Lahtinen]
>>
>> Added runtime PM calls as these must be done at the top level before
>> acquiring the driver mutex lock. [Chris Wilson]
>>
>> Removed some obsolete debug code that had been forgotten about.
>>
>> Moved more clean up code into the 'i915_gem_scheduler_clean_node()'
>> function rather than replicating it in mutliple places.
>>
>> Used lighter weight spinlocks.
>>
>> For: VIZ-1587
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/Makefile         |   1 +
>>   drivers/gpu/drm/i915/i915_drv.h       |   6 +
>>   drivers/gpu/drm/i915/i915_gem.c       |   5 +
>>   drivers/gpu/drm/i915/i915_scheduler.c | 874 ++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_scheduler.h |  95 ++++
>>   5 files changed, 981 insertions(+)
>>   create mode 100644 drivers/gpu/drm/i915/i915_scheduler.c
>>   create mode 100644 drivers/gpu/drm/i915/i915_scheduler.h
>>
>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
>> index 15398c5..79cb38b 100644
>> --- a/drivers/gpu/drm/i915/Makefile
>> +++ b/drivers/gpu/drm/i915/Makefile
>> @@ -10,6 +10,7 @@ ccflags-y := -Werror
>>   i915-y := i915_drv.o \
>>   	  i915_irq.o \
>>   	  i915_params.o \
>> +	  i915_scheduler.o \
>>             i915_suspend.o \
>>   	  i915_sysfs.o \
>>   	  intel_csr.o \
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index f4487b9..03add1a 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1703,6 +1703,8 @@ struct i915_execbuffer_params {
>>   	struct drm_i915_gem_request     *request;
>>   };
>>   
>> +struct i915_scheduler;
>> +
>>   /* used in computing the new watermarks state */
>>   struct intel_wm_config {
>>   	unsigned int num_pipes_active;
>> @@ -1968,6 +1970,8 @@ struct drm_i915_private {
>>   
>>   	struct i915_runtime_pm pm;
>>   
>> +	struct i915_scheduler *scheduler;
>> +
> I think we should have i915.enable_scheduler parameter (just like
> i915.enable_execlists) and make this a data member, not pointer.
>
> Then we can go forward and have i915.enable_scheduler_preempt and so on
> to control things at boot time.
There are. The enable_scheduler comes in as patch #21. Including it from 
the start means modifying i915_params.[ch] in this patch as well. And 
then you are getting into one big patch that contains everything and is 
unreviewable due to its sheer size. The enable_preemption parameter 
comes in with the pre-emption code. No point in adding that until there 
is something for it to control.

>
>>   	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
>>   	struct {
>>   		int (*execbuf_submit)(struct i915_execbuffer_params *params,
>> @@ -2290,6 +2294,8 @@ struct drm_i915_gem_request {
>>   	/** process identifier submitting this request */
>>   	struct pid *pid;
>>   
>> +	struct i915_scheduler_queue_entry	*scheduler_qe;
>> +
> Funny whitespace.
>
>>   	/**
>>   	 * The ELSP only accepts two elements at a time, so we queue
>>   	 * context/tail pairs on a given queue (ring->execlist_queue) until the
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index dfe43ea..7d9aa24 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -32,6 +32,7 @@
>>   #include "i915_vgpu.h"
>>   #include "i915_trace.h"
>>   #include "intel_drv.h"
>> +#include "i915_scheduler.h"
>>   #include
>>   #include
>>   #include
>> @@ -5315,6 +5316,10 @@ int i915_gem_init(struct drm_device *dev)
>>   	 */
>>   	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>>   
>> +	ret = i915_scheduler_init(dev);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>>   	ret = i915_gem_init_userptr(dev);
>>   	if (ret)
>>   		goto out_unlock;
>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
>> new file mode 100644
>> index 0000000..fc23ee7
>> --- /dev/null
>> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
>> @@ -0,0 +1,874 @@
>> +/*
>> + * Copyright (c) 2014 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
>> + * IN THE SOFTWARE.
>> + *
>> + */
>> +
>> +#include "i915_drv.h"
>> +#include "intel_drv.h"
>> +#include "i915_scheduler.h"
>> +
>> +/**
>> + * i915_scheduler_is_enabled - Returns true if the scheduler is enabled.
>> + * @dev: DRM device
>> + */
>> +bool i915_scheduler_is_enabled(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +
>> +	return dev_priv->scheduler != NULL;
>> +}
> i915.enable_scheduler would be used instead.
It is when it arrives.

>
>> +
>> +/**
>> + * i915_scheduler_init - Initialise the scheduler.
>> + * @dev: DRM device
>> + * Returns zero on success or -ENOMEM if memory allocations fail.
>> + */
>> +int i915_scheduler_init(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	int r;
>> +
> This protection -->
>
>> +	if (scheduler)
>> +		return 0;
>> +
>> +	scheduler = kzalloc(sizeof(*scheduler), GFP_KERNEL);
>> +	if (!scheduler)
>> +		return -ENOMEM;
>> +
> Is not needed if we have data member and i915.enable_scheduler.
Making the scheduler structure a object rather than a pointer basically 
means moving the whole of i915_scheduler.h into i915_drv.h. There is too 
much interconnectedness between the various structures. I would say that 
keeping the code nicely partitioned and readable is worth the one 
dynamic allocation at driver load time.

>
>> +	spin_lock_init(&scheduler->lock);
>> +
>> +	for (r = 0; r < I915_NUM_RINGS; r++)
>> +		INIT_LIST_HEAD(&scheduler->node_queue[r]);
>> +
>> +	/* Default tuning values: */
>> +	scheduler->priority_level_min     = -1023;
>> +	scheduler->priority_level_max     = 1023;
>> +	scheduler->priority_level_bump    = 50;
>> +	scheduler->priority_level_preempt = 900;
>> +	scheduler->min_flying             = 2;
>> +
>> +	dev_priv->scheduler = scheduler;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Add a popped node back in to the queue. For example, because the ring was
>> + * hung when execfinal() was called and thus the ring submission needs to be
>> + * retried later.
>> + */
>> +static void i915_scheduler_node_requeue(struct i915_scheduler_queue_entry *node)
>> +{
>> +	WARN_ON(!node);
> Rather remove this completely as it is an internal function, or use
> BUG_ON because next line will straight lead to OOPS after warning. I'll
> pass mentioning rest of the obvious WARN_ON vs BUG_ON's.
Yeah, most of the WARN/BUG_ONs are probably unnecessary now. They were 
scattered liberally around during initial development when strange and 
unexpected things could happen. Now the the code paths are pretty well 
established they aren't really much use. I'll reduce them down to just 
the ones that still make sense.

>
>> +	WARN_ON(!I915_SQS_IS_FLYING(node));
>> +
>> +	/* Seqno will be reassigned on relaunch */
>> +	node->params.request->seqno = 0;
>> +	node->status = i915_sqs_queued;
>> +}
>> +
>> +/*
>> + * Give up on a node completely. For example, because it is causing the
>> + * ring to hang or is using some resource that no longer exists.
>> + */
>> +static void i915_scheduler_node_kill(struct i915_scheduler_queue_entry *node)
>> +{
>> +	WARN_ON(!node);
>> +	WARN_ON(I915_SQS_IS_COMPLETE(node));
>> +
>> +	node->status = i915_sqs_dead;
>> +}
>> +
>> +/* Mark a node as in flight on the hardware. */
> As for documentation, from below, I only assume scheduler lock must be
> held and node's can be only manipulated when scheduler lock is held?
> It's good to add a comment describing the expected locking. Or maybe
> add assert_scheduler_lock_held() helper and sprinkle it around the
> functions similar to assert_rpm_wakelock_held(), which is self-
> documenting?
>
>> +static int i915_scheduler_node_fly(struct i915_scheduler_queue_entry *node)
>> +{
>> +	struct drm_i915_private *dev_priv = node->params.dev->dev_private;
> If node is NULL, this would already OOPS, I do not think it's
> unreasonable to expect node not to be NULL. The below WARN_ON check is
> never reached.
>
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct intel_engine_cs *ring;
>> +
>> +	WARN_ON(!scheduler);
>> +	WARN_ON(!node);
>> +	WARN_ON(node->status != i915_sqs_popped);
>> +
>> +	ring = node->params.ring;
>> +
> This and similar assignments can be squashed to declaration line, as
> this is basically an alias.
>
>> +	/*
>> +	 * Add the node (which should currently be in state popped) to the
>> +	 * front of the queue. This ensure that flying nodes are always held
>> +	 * in hardware submission order.
>> +	 */
>> +	list_add(&node->link, &scheduler->node_queue[ring->id]);
>> +
>> +	node->status = i915_sqs_flying;
>> +
>> +	if (!(scheduler->flags[ring->id] & i915_sf_interrupts_enabled)) {
>> +		bool success = true;
>> +
>> +		success = ring->irq_get(ring);
>> +		if (success)
>> +			scheduler->flags[ring->id] |= i915_sf_interrupts_enabled;
>> +		else
>> +			return -EINVAL;
> Shouldn't the node be cleaned up from node_queue in case of an error?
Not really. The batch buffer will still be executed regardless of 
whether interrupts could be enabled or not. The return code is actually 
ignored so this should really be a void function, I guess. Without 
interrupts then completion might take a while to be noticed but stuff 
should still basically work.

>
> Also, I think above could be written much more compact form (because it
> looks like something where more logic will be added later). It makes it
> easier to write and visualize the error paths when reading if there are
> no nested if's.
>
> I won't mention about the error paths of functions below, I expect the
> following guidline to be adhered;
>
> 	if (scheduler->flags[ring->id] & I915_SF_INT_ENABLED)
> 		goto out;
>
> 	if (!ring->irq_get(ring))
> 		goto err_irq;
>
> 	if (!...)
> 		goto err_foo;
>
> 	scheduler->flags[ring->id] |= I915_SF_INT_ENABLED;
> out:
> 	return 0;
> err_foo:
> 	foobar();
> err_irq:
> 	list_remove()
> 	return -EINVAL;
>
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static uint32_t i915_scheduler_count_flying(struct i915_scheduler *scheduler,
>> +					    struct intel_engine_cs *ring)
>> +{
>> +	struct i915_scheduler_queue_entry *node;
>> +	uint32_t flying = 0;
>> +
>> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link)
> +1 for the for_each_scheduler_node(...)
>
>> +		if (I915_SQS_IS_FLYING(node))
>> +			flying++;
>> +
>> +	return flying;
>> +}
>> +
>> +static void i915_scheduler_priority_bump_clear(struct i915_scheduler *scheduler)
>> +{
>> +	struct i915_scheduler_queue_entry *node;
>> +	int i;
>> +
>> +	/*
>> +	 * Ensure circular dependencies don't cause problems and that a bump
>> +	 * by object usage only bumps each using buffer once:
>> +	 */
>> +	for (i = 0; i < I915_NUM_RINGS; i++) {
>> +		list_for_each_entry(node, &scheduler->node_queue[i], link)
>> +			node->bumped = false;
>> +	}
>> +}
>> +
>> +static int i915_scheduler_priority_bump(struct i915_scheduler *scheduler,
>> +				struct i915_scheduler_queue_entry *target,
> Is there specific reason why struct i915_scheduler_queue_entry are not
> referred to just as "node" but "qe" and here something else, do "node"
> and "qe" have a special semantic meaning?
Not sure what you mean. The use of 'node' is because almost everything 
an i915_scheduler_queue_entry object is being used it is as a node in a 
list that is being iterated over. The only 'qe' variable is in 
'i915_scheduler_queue_execbuffer[_bypass]' and there it is because the 
qe is being passed in from outside and is copied to a local 'node' that 
is then added to the list. I think the use of 'target' here is largely 
historical. The subsequent array search was originally a list search and 
hence had a 'node' that was used as the iterator. Thus the qe passed in 
was called something else to avoid conflicts and target seemed appropriate.

>> +				uint32_t bump)
>> +{
>> +	uint32_t new_priority;
>> +	int i, count;
>> +
>> +	if (target->priority >= scheduler->priority_level_max)
>> +		return 1;
> So if one node reaches maximum priority the dependencies are expected
> to be maxed out already?
Yes. Dependencies should always be of equal or higher priority to the 
dependee. When a new dependency is created, the priority of the dependee 
is added on to ensure that something of high priority can never be stuck 
waiting on something of low priority. Also, it means that a batch buffer 
that lots of other buffers are waiting on will be bumped to higher and 
higher priority the more waiters there are and thus is more likely to be 
executed sooner and free them all up.

>> +
>> +	if (target->bumped)
>> +		return 0;
>> +
>> +	new_priority = target->priority + bump;
>> +	if ((new_priority <= target->priority) ||
>> +	    (new_priority > scheduler->priority_level_max))
>> +		target->priority = scheduler->priority_level_max;
>> +	else
>> +		target->priority = new_priority;
>> +
>> +	count = 1;
>> +	target->bumped = true;
>> +
>> +	for (i = 0; i < target->num_deps; i++) {
>> +		if (!target->dep_list[i])
>> +			continue;
>> +
>> +		if (target->dep_list[i]->bumped)
>> +			continue;
>> +
>> +		count += i915_scheduler_priority_bump(scheduler,
>> +						      target->dep_list[i],
>> +						      bump);
>> +	}
>> +
>> +	return count;
>> +}
>> +
>> +/*
>> + * Nodes are considered valid dependencies if they are queued on any ring or
>> + * if they are in flight on a different ring. In flight on the same ring is no
>> + * longer interesting for non-premptive nodes as the ring serialises execution.
>> + * For pre-empting nodes, all in flight dependencies are valid as they must not
>> + * be jumped by the act of pre-empting.
>> + *
>> + * Anything that is neither queued nor flying is uninteresting.
>> + */
>> +static inline bool i915_scheduler_is_dependency_valid(
>> +			struct i915_scheduler_queue_entry *node, uint32_t idx)
>> +{
>> +	struct i915_scheduler_queue_entry *dep;
>> +
>> +	dep = node->dep_list[idx];
>> +	if (!dep)
>> +		return false;
>> +
>> +	if (I915_SQS_IS_QUEUED(dep))
>> +		return true;
>> +
>> +	if (I915_SQS_IS_FLYING(dep)) {
>> +		if (node->params.ring != dep->params.ring)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static int i915_scheduler_pop_from_queue_locked(struct intel_engine_cs *ring,
>> +				struct i915_scheduler_queue_entry **pop_node)
>> +{
>> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct i915_scheduler_queue_entry *best = NULL;
>> +	struct i915_scheduler_queue_entry *node;
>> +	int ret;
>> +	int i;
>> +	bool any_queued = false;
>> +	bool has_local, has_remote, only_remote = false;
>> +
>> +	*pop_node = NULL;
>> +	ret = -ENODATA;
>> +
>> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
>> +		if (!I915_SQS_IS_QUEUED(node))
>> +			continue;
>> +		any_queued = true;
>> +
>> +		has_local  = false;
>> +		has_remote = false;
>> +		for (i = 0; i < node->num_deps; i++) {
>> +			if (!i915_scheduler_is_dependency_valid(node, i))
>> +				continue;
>> +
>> +			if (node->dep_list[i]->params.ring == node->params.ring)
>> +				has_local = true;
>> +			else
>> +				has_remote = true;
>> +		}
>> +
>> +		if (has_remote && !has_local)
>> +			only_remote = true;
>> +
>> +		if (!has_local && !has_remote) {
>> +			if (!best ||
>> +			    (node->priority > best->priority))
>> +				best = node;
>> +		}
>> +	}
>> +
>> +	if (best) {
>> +		list_del(&best->link);
>> +
>> +		INIT_LIST_HEAD(&best->link);
>> +		best->status  = i915_sqs_popped;
>> +
>> +		ret = 0;
>> +	} else {
>> +		/* Can only get here if:
>> +		 * (a) there are no buffers in the queue
>> +		 * (b) all queued buffers are dependent on other buffers
>> +		 *     e.g. on a buffer that is in flight on a different ring
>> +		 */
>> +		if (only_remote) {
>> +			/* The only dependent buffers are on another ring. */
>> +			ret = -EAGAIN;
>> +		} else if (any_queued) {
>> +			/* It seems that something has gone horribly wrong! */
>> +			DRM_ERROR("Broken dependency tracking on ring %d!\n",
>> +				  (int) ring->id);
> Would this condition be worthy WARN_ONCE(), because it will keep
> repeating as the situation will persist?
>
>> +		}
>> +	}
>> +
>> +	*pop_node = best;
>> +	return ret;
>> +}
>> +
>> +/*
>> + * NB: The driver mutex lock must be held before calling this function. It is
>> + * only really required during the actual back end submission call. However,
>> + * attempting to acquire a mutex while holding a spin lock is a Bad Idea.
>> + * And releasing the one before acquiring the other leads to other code
>> + * being run and interfering.
>> + */
>> +static int i915_scheduler_submit(struct intel_engine_cs *ring)
>> +{
>> +	struct drm_device *dev = ring->dev;
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct i915_scheduler_queue_entry *node;
>> +	int ret, count = 0, flying;
>> +
>> +	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
>> +
>> +	spin_lock_irq(&scheduler->lock);
>> +
>> +	WARN_ON(scheduler->flags[ring->id] & i915_sf_submitting);
>> +	scheduler->flags[ring->id] |= i915_sf_submitting;
>> +
>> +	/* First time around, complain if anything unexpected occurs: */
>> +	ret = i915_scheduler_pop_from_queue_locked(ring, &node);
>> +	if (ret)
>> +		goto error;
>> +
>> +	do {
>> +		WARN_ON(!node);
>> +		WARN_ON(node->params.ring != ring);
>> +		WARN_ON(node->status != i915_sqs_popped);
>> +		count++;
>> +
>> +		/*
>> +		 * The call to pop above will have removed the node from the
>> +		 * list. So add it back in and mark it as in flight.
>> +		 */
>> +		i915_scheduler_node_fly(node);
>> +
>> +		spin_unlock_irq(&scheduler->lock);
>> +		ret = dev_priv->gt.execbuf_final(&node->params);
>> +		spin_lock_irq(&scheduler->lock);
>> +
>> +		/*
>> +		 * Handle failed submission but first check that the
>> +		 * watchdog/reset code has not nuked the node while we
>> +		 * weren't looking:
>> +		 */
>> +		if (ret && (node->status != i915_sqs_dead)) {
>> +			bool requeue = true;
>> +
>> +			/*
>> +			 * Oh dear! Either the node is broken or the ring is
>> +			 * busy. So need to kill the node or requeue it and try
>> +			 * again later as appropriate.
>> +			 */
>> +
>> +			switch (-ret) {
>> +			case ENODEV:
>> +			case ENOENT:
>> +				/* Fatal errors. Kill the node. */
>> +				requeue = false;
>> +				i915_scheduler_node_kill(node);
>> +			break;
> These break indents still.
>
>> +
>> +			case EAGAIN:
>> +			case EBUSY:
>> +			case EIO:
>> +			case ENOMEM:
>> +			case ERESTARTSYS:
>> +			case EINTR:
>> +				/* Supposedly recoverable errors. */
>> +			break;
>> +
>> +			default:
>> +				/*
>> +				 * Assume the error is recoverable and hope
>> +				 * for the best.
>> +				 */
>> +				MISSING_CASE(-ret);
>> +			break;
>> +			}
>> +
>> +			if (requeue) {
>> +				i915_scheduler_node_requeue(node);
>> +				/*
>> +				 * No point spinning if the ring is currently
>> +				 * unavailable so just give up and come back
>> +				 * later.
>> +				 */
>> +				break;
>> +			}
>> +		}
>> +
>> +		/* Keep launching until the sky is sufficiently full. */
>> +		flying = i915_scheduler_count_flying(scheduler, ring);
>> +		if (flying >= scheduler->min_flying)
>> +			break;
>> +
>> +		/* Grab another node and go round again... */
>> +		ret = i915_scheduler_pop_from_queue_locked(ring, &node);
>> +	} while (ret == 0);
>> +
>> +	/* Don't complain about not being able to submit extra entries */
>> +	if (ret == -ENODATA)
>> +		ret = 0;
>> +
>> +	/*
>> +	 * Bump the priority of everything that was not submitted to prevent
>> +	 * starvation of low priority tasks by a spamming high priority task.
>> +	 */
> This, this calls for an I-G-T test to make sure we're effective.
Yeah, IGTs are under development.

>> +	i915_scheduler_priority_bump_clear(scheduler);
>> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
>> +		if (!I915_SQS_IS_QUEUED(node))
>> +			continue;
>> +
>> +		i915_scheduler_priority_bump(scheduler, node,
>> +					     scheduler->priority_level_bump);
>> +	}
>> +
>> +	/* On success, return the number of buffers submitted. */
>> +	if (ret == 0)
>> +		ret = count;
>> +
>> +error:
>> +	scheduler->flags[ring->id] &= ~i915_sf_submitting;
>> +	spin_unlock_irq(&scheduler->lock);
>> +	return ret;
>> +}
>> +
>> +static void i915_generate_dependencies(struct i915_scheduler *scheduler,
>> +				       struct i915_scheduler_queue_entry *node,
>> +				       uint32_t ring)
>> +{
>> +	struct i915_scheduler_obj_entry *this, *that;
>> +	struct i915_scheduler_queue_entry *test;
>> +	int i, j;
>> +	bool found;
>> +
>> +	list_for_each_entry(test, &scheduler->node_queue[ring], link) {
>> +		if (I915_SQS_IS_COMPLETE(test))
>> +			continue;
>> +
>> +		/*
>> +		 * Batches on the same ring for the same
>> +		 * context must be kept in order.
>> +		 */
>> +		found = (node->params.ctx == test->params.ctx) &&
>> +			(node->params.ring == test->params.ring);
>> +
>> +		/*
>> +		 * Batches working on the same objects must
>> +		 * be kept in order.
>> +		 */
>> +		for (i = 0; (i < node->num_objs) && !found; i++) {
>> +			this = node->saved_objects + i;
> node->objs and node->num_objs would be a better naming match. num_objs
> and saved_objects makes me think we have a bug here, indexing wrong
> array.
>
>> +
>> +			for (j = 0; j < test->num_objs; j++) {
>> +				that = test->saved_objects + j;
>> +
>> +				if (this->obj != that->obj)
>> +					continue;
>> +
>> +				/* Only need to worry about writes */
>> +				if (this->read_only && that->read_only)
>> +					continue;
>> +
>> +				found = true;
>> +				break;
>> +			}
>> +		}
>> +
>> +		if (found) {
>> +			node->dep_list[node->num_deps] = test;
>> +			node->num_deps++;
>> +		}
>> +	}
>> +}
>> +
>> +static int i915_scheduler_queue_execbuffer_bypass(struct i915_scheduler_queue_entry *qe)
>> +{
>> +	struct drm_i915_private *dev_priv = qe->params.dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	int ret;
>> +
>> +	scheduler->flags[qe->params.ring->id] |= i915_sf_submitting;
>> +	ret = dev_priv->gt.execbuf_final(&qe->params);
>> +	scheduler->flags[qe->params.ring->id] &= ~i915_sf_submitting;
>> +
> The above code makes me think of locking again, maybe document at top
> of this function.
No need to acquire the spinlock at this point. If the scheduler is in 
bypass mode then there is no internal state to protect. The driver mutex 
lock will be held because there is only one code path to get here which 
is the execbuff IOCTL. The whole point of bypass mode is that it is a 
single contiguous execution path from IOCTL entry all the way through to 
hardware submission. A mutex check could be added but it seems 
unnecessary given the limited calling scope.

>
>> +	/*
>> +	 * Don't do any clean up on failure because the caller will
>> +	 * do it all anyway.
>> +	 */
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* Need to release any resources held by the node: */
>> +	i915_scheduler_clean_node(qe);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * i915_scheduler_queue_execbuffer - Submit a batch buffer request to the
>> + * scheduler.
>> + * @qe: The batch buffer request to be queued.
>> + * The expectation is the qe passed in is a local stack variable. This
>> + * function will copy its contents into a freshly allocated list node. The
>> + * new node takes ownership of said contents so the original qe should simply
>> + * be discarded and not cleaned up (i.e. don't free memory it points to or
>> + * dereference objects it holds). The node is added to the scheduler's queue
>> + * and the batch buffer will be submitted to the hardware at some future
>> + * point in time (which may be immediately, before returning or may be quite
>> + * a lot later).
>> + */
>> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe)
>> +{
>> +	struct drm_i915_private *dev_priv = qe->params.dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct intel_engine_cs *ring = qe->params.ring;
>> +	struct i915_scheduler_queue_entry *node;
>> +	struct i915_scheduler_queue_entry *test;
>> +	bool not_flying;
>> +	int i, r;
>> +	int incomplete = 0;
>> +
>> +	WARN_ON(!scheduler);
>> +
>> +	if (1/*!i915.enable_scheduler*/)
>> +		return i915_scheduler_queue_execbuffer_bypass(qe);
>> +
> I'm thinking, should we branch already before calling a function named
> i915_scheduler_queue_execbuffer if scheduler is disabled?
That would require putting if(scheduler) logic in random points 
throughout the driver. Doing it this way, the driver as a whole has a 
single execution path which is via the scheduler. It is then up to the 
scheduler itself to decide whether to run in bypass mode, simple 
re-ordering mode, full pre-emption mode or any other mode we feel a need 
to add.

>> +	node = kmalloc(sizeof(*node), GFP_KERNEL);
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	*node = *qe;
> Now I read the added documentation for the function, maybe we should at
> least zero out qe to avoid future problems?
The execbuff success path hands the qe contents over and pretty much 
immediately drops all the way back out of the IOCTL. The failure path 
does the full cleanup of de-allocating all the qe stuff it had 
previously allocated. IMO there doesn't seem to be much need to zero out 
the structure at this point.

>
>> +	INIT_LIST_HEAD(&node->link);
>> +	node->status = i915_sqs_queued;
>> +	node->stamp  = jiffies;
>> +	i915_gem_request_reference(node->params.request);
>> +
>> +	WARN_ON(node->params.request->scheduler_qe);
>> +	node->params.request->scheduler_qe = node;
>> +
>> +	/* Need to determine the number of incomplete entries in the list as
>> +	 * that will be the maximum size of the dependency list.
>> +	 *
>> +	 * Note that the allocation must not be made with the spinlock acquired
>> +	 * as kmalloc can sleep. However, the unlock/relock is safe because no
>> +	 * new entries can be queued up during the unlock as the i915 driver
>> +	 * mutex is still held. Entries could be removed from the list but that
>> +	 * just means the dep_list will be over-allocated which is fine.
>> +	 */
>> +	spin_lock_irq(&scheduler->lock);
> -->
>
>> +	for (r = 0; r < I915_NUM_RINGS; r++) {
>> +		list_for_each_entry(test, &scheduler->node_queue[r], link) {
>> +			if (I915_SQS_IS_COMPLETE(test))
>> +				continue;
>> +
>> +			incomplete++;
>> +		}
>> +	}
>> +
> <-- This counting structure, I still think it would be good idea to
> make it a static helper.
>
>> +	/* Temporarily unlock to allocate memory: */
>> +	spin_unlock_irq(&scheduler->lock);
>> +	if (incomplete) {
>> +		node->dep_list = kmalloc_array(incomplete,
>> +					       sizeof(*node->dep_list),
>> +					       GFP_KERNEL);
>> +		if (!node->dep_list) {
>> +			kfree(node);
>> +			return -ENOMEM;
>> +		}
>> +	} else
>> +		node->dep_list = NULL;
>> +
>> +	spin_lock_irq(&scheduler->lock);
>> +	node->num_deps = 0;
>> +
>> +	if (node->dep_list) {
>> +		for (r = 0; r < I915_NUM_RINGS; r++)
>> +			i915_generate_dependencies(scheduler, node, r);
>> +
>> +		WARN_ON(node->num_deps > incomplete);
>> +	}
>> +
>> +	node->priority = clamp(node->priority,
>> +			       scheduler->priority_level_min,
>> +			       scheduler->priority_level_max);
>> +
>> +	if ((node->priority > 0) && node->num_deps) {
>> +		i915_scheduler_priority_bump_clear(scheduler);
>> +
>> +		for (i = 0; i < node->num_deps; i++)
>> +			i915_scheduler_priority_bump(scheduler,
>> +					node->dep_list[i], node->priority);
> If node is posted with maximum priority to begin with, wouldn't the
> priority propagation stop after first level dependencies due to the
> check in the beginning of priority_bump function?
The recursion will stop whenever it encounters a node already at maximum 
priority. However, it is not starting from the newly submitted node but 
from that node's dependencies. If they are already at max priority then 
their dependencies must also be and so on down the chain and no further 
processing is required. On the other hand, if they are not then they 
will be bumped and their dependencies checked and so on down the chain.

>> +
>> +	list_add_tail(&node->link, &scheduler->node_queue[ring->id]);
>> +
>> +	not_flying = i915_scheduler_count_flying(scheduler, ring) <
>> +						 scheduler->min_flying;
>> +
>> +	spin_unlock_irq(&scheduler->lock);
>> +
>> +	if (not_flying)
>> +		i915_scheduler_submit(ring);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * i915_scheduler_notify_request - Notify the scheduler that the given
>> + * request has completed on the hardware.
>> + * @req: Request structure which has completed
>> + * @preempt: Did it complete pre-emptively?
>> + * A sequence number has popped out of the hardware and the request handling
>> + * code has mapped it back to a request and will mark that request complete.
>> + * It also calls this function to notify the scheduler about the completion
>> + * so the scheduler's node can be updated appropriately.
>> + * Returns true if the request is scheduler managed, false if not. The return
>> + * value is combined for all freshly completed requests and if any were true
>> + * then i915_scheduler_wakeup() is called so the scheduler can do further
>> + * processing (submit more work) at the end.
>> + */
>> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req)
>> +{
>> +	struct drm_i915_private *dev_priv = to_i915(req->ring->dev);
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct i915_scheduler_queue_entry *node = req->scheduler_qe;
>> +	unsigned long flags;
>> +
>> +	if (!node)
>> +		return false;
>> +
>> +	spin_lock_irqsave(&scheduler->lock, flags);
>> +
>> +	WARN_ON(!I915_SQS_IS_FLYING(node));
>> +
>> +	/* Node was in flight so mark it as complete. */
>> +	if (req->cancelled)
>> +		node->status = i915_sqs_dead;
>> +	else
>> +		node->status = i915_sqs_complete;
>> +
>> +	spin_unlock_irqrestore(&scheduler->lock, flags);
>> +
>> +	return true;
>> +}
>> +
>> +static int i915_scheduler_remove_dependent(struct i915_scheduler *scheduler,
>> +				struct i915_scheduler_queue_entry *remove)
>> +{
>> +	struct i915_scheduler_queue_entry *node;
>> +	int i, r;
>> +	int count = 0;
>> +
>> +	/*
>> +	 * Ensure that a node is not being removed which is still dependent
>> +	 * upon other (not completed) work. If that happens, it implies
>> +	 * something has gone very wrong with the dependency tracking! Note
>> +	 * that there is no need to worry if this node has been explicitly
>> +	 * killed for some reason - it might be being killed before it got
>> +	 * sent to the hardware.
>> +	 */
>> +	if (remove->status != i915_sqs_dead) {
>> +		for (i = 0; i < remove->num_deps; i++)
>> +			if ((remove->dep_list[i]) &&
>> +			    (!I915_SQS_IS_COMPLETE(remove->dep_list[i])))
>> +				count++;
>> +		WARN_ON(count);
>> +	}
>> +
>> +	/*
>> +	 * Remove this node from the dependency lists of any other node which
>> +	 * might be waiting on it.
>> +	 */
>> +	for (r = 0; r < I915_NUM_RINGS; r++) {
>> +		list_for_each_entry(node, &scheduler->node_queue[r], link) {
>> +			for (i = 0; i < node->num_deps; i++) {
>> +				if (node->dep_list[i] != remove)
>> +					continue;
>> +
>> +				node->dep_list[i] = NULL;
>> +			}
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * i915_scheduler_wakeup - wake the scheduler's worker thread
>> + * @dev: DRM device
>> + * Called at the end of seqno interrupt processing if any request has
>> + * completed that corresponds to a scheduler node.
>> + */
>> +void i915_scheduler_wakeup(struct drm_device *dev)
>> +{
>> +	/* XXX: Need to call i915_scheduler_remove() via work handler. */
>> +}
>> +
>> +/**
>> + * i915_scheduler_clean_node - free up any allocations/references
>> + * associated with the given scheduler queue entry.
>> + * @node: Queue entry structure which is complete
>> + * After a give batch buffer completes on the hardware, all the information
>> + * required to resubmit it is no longer required. However, the node entry
>> + * itself might still be required for tracking purposes for a while longer.
>> + * This function should be called as soon as the node is known to be complete
>> + * so that these resources may be freed even though the node itself might
>> + * hang around.
>> + */
>> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node)
>> +{
>> +	if (!I915_SQS_IS_COMPLETE(node)) {
>> +		WARN(!node->params.request->cancelled,
>> +		     "Cleaning active node: %d!\n", node->status);
>> +		return;
>> +	}
>> +
>> +	if (node->params.batch_obj) {
>> +		/*
>> +		 * The batch buffer must be unpinned before it is unreferenced
>> +		 * otherwise the unpin fails with a missing vma!?
>> +		 */
>> +		if (node->params.dispatch_flags & I915_DISPATCH_SECURE)
>> +			i915_gem_execbuff_release_batch_obj(node->params.batch_obj);
>> +
> This code seems little bit out of place.
Not sure where else you could put it. The code itself is just a copy of 
the existing clean up code that was already in the execbuff code path. 
The cleanup can no longer be done within the execbuff call due to the 
batch buffer execution being arbitrarily delayed by the scheduler. Thus 
it is down to the scheduler to do all the necessary cleanup when it 
knows that the batch buffer is no longer required.

>
>> +		node->params.batch_obj = NULL;
>> +	}
>> +
>> +	/* And anything else owned by the node: */
>> +	if (node->params.cliprects) {
>> +		kfree(node->params.cliprects);
>> +		node->params.cliprects = NULL;
>> +	}
>> +}
>> +
> The below function is quite hacky, I understood this will be mitigated
> by per-ctx sequence numbering in the future?
To be honest, I think it is already obsolete due to the previous work of 
converting from seqno tests everywhere to testing request structures. 
The point of the code below was to cope with seqno values popping out of 
the hardware in random order due to the re-ordering of batch buffers 
prior to execution. Whereas, the struct request conversion allowed the 
seqno to be late allocated and thus kept in order.

Unfortunately, pre-emption confuses matters again. Although the 
intention is that seqno values will still be kept in order. It's just 
that a given request might have multiple different seqno values assigned 
to it throughout its life. The pre-emption code hasn't quite settled 
down yet though, especially with interesting new hardware planned for 
future chips. Thus I don't really want to rip out all of this code quite 
yet just in case we do still need it.

As you say, per-ctx seqnos would definitely have an effect here as well. 
So yes, long term the intention is to remove all this nastyness. Short 
term, it seems best to keep it around until proven unnecessary.

>
>> +static bool i915_scheduler_remove(struct i915_scheduler *scheduler,
>> +				  struct intel_engine_cs *ring,
>> +				  struct list_head *remove)
>> +{
>> +	struct i915_scheduler_queue_entry *node, *node_next;
>> +	int flying = 0, queued = 0;
>> +	bool do_submit;
>> +	uint32_t min_seqno;
>> +
>> +	spin_lock_irq(&scheduler->lock);
>> +
>> +	/*
>> +	 * In the case where the system is idle, starting 'min_seqno' from a big
>> +	 * number will cause all nodes to be removed as they are now back to
>> +	 * being in-order. However, this will be a problem if the last one to
>> +	 * complete was actually out-of-order as the ring seqno value will be
>> +	 * lower than one or more completed buffers. Thus code looking for the
>> +	 * completion of said buffers will wait forever.
>> +	 * Instead, use the hardware seqno as the starting point. This means
>> +	 * that some buffers might be kept around even in a completely idle
>> +	 * system but it should guarantee that no-one ever gets confused when
>> +	 * waiting for buffer completion.
>> +	 */
>> +	min_seqno = ring->get_seqno(ring, true);
>> +
>> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
>> +		if (I915_SQS_IS_QUEUED(node))
>> +			queued++;
>> +		else if (I915_SQS_IS_FLYING(node))
>> +			flying++;
>> +		else if (I915_SQS_IS_COMPLETE(node))
>> +			continue;
>> +
>> +		if (node->params.request->seqno == 0)
>> +			continue;
>> +
>> +		if (!i915_seqno_passed(node->params.request->seqno, min_seqno))
>> +			min_seqno = node->params.request->seqno;
>> +	}
>> +
>> +	INIT_LIST_HEAD(remove);
>> +	list_for_each_entry_safe(node, node_next, &scheduler->node_queue[ring->id], link) {
>> +		/*
>> +		 * Only remove completed nodes which have a lower seqno than
>> +		 * all pending nodes. While there is the possibility of the
>> +		 * ring's seqno counting backwards, all higher buffers must
>> +		 * be remembered so that the 'i915_seqno_passed()' test can
>> +		 * report that they have in fact passed.
>> +		 *
>> +		 * NB: This is not true for 'dead' nodes. The GPU reset causes
>> +		 * the software seqno to restart from its initial value. Thus
>> +		 * the dead nodes must be removed even though their seqno values
>> +		 * are potentially vastly greater than the current ring seqno.
>> +		 */
>> +		if (!I915_SQS_IS_COMPLETE(node))
>> +			continue;
>> +
>> +		if (node->status != i915_sqs_dead) {
>> +			if (i915_seqno_passed(node->params.request->seqno, min_seqno) &&
>> +			    (node->params.request->seqno != min_seqno))
>> +				continue;
>> +		}
>> +
>> +		list_del(&node->link);
>> +		list_add(&node->link, remove);
>> +
>> +		/* Strip the dependency info while the mutex is still locked */
>> +		i915_scheduler_remove_dependent(scheduler, node);
>> +
>> +		continue;
>> +	}
>> +
>> +	/*
>> +	 * Release the interrupt reference count if there are no longer any
>> +	 * nodes to worry about.
>> +	 */
>> +	if (!flying && !queued &&
>> +	    (scheduler->flags[ring->id] & i915_sf_interrupts_enabled)) {
>> +		ring->irq_put(ring);
>> +		scheduler->flags[ring->id] &= ~i915_sf_interrupts_enabled;
> Maybe have this piece of code as a helper? To enable interrupts and
> disable them, maybe even a reference count?
Not sure how much you could abstract out into a helper other than the 
two lines of irq_get|put + flag set|clear. The flying|queued test is 
particular to this one instance. Also, there is no need to reference 
count as the irq code does that already.

>
>> +	}
>> +
>> +	/* Launch more packets now? */
>> +	do_submit = (queued > 0) && (flying < scheduler->min_flying);
>> +
>> +	spin_unlock_irq(&scheduler->lock);
>> +
>> +	return do_submit;
>> +}
>> +
>> +void i915_scheduler_process_work(struct intel_engine_cs *ring)
>> +{
>> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
>> +	struct i915_scheduler_queue_entry *node;
>> +	bool do_submit;
>> +	struct list_head remove;
>> +
>> +	if (list_empty(&scheduler->node_queue[ring->id]))
>> +		return;
>> +
>> +	/* Remove completed nodes. */
>> +	do_submit = i915_scheduler_remove(scheduler, ring, &remove);
>> +
>> +	if (!do_submit && list_empty(&remove))
>> +		return;
>> +
>> +	/* Need to grab the pm lock outside of the mutex lock */
>> +	if (do_submit)
>> +		intel_runtime_pm_get(dev_priv);
>> +
>> +	mutex_lock(&ring->dev->struct_mutex);
>> +
>> +	if (do_submit)
>> +		i915_scheduler_submit(ring);
>> +
>> +	while (!list_empty(&remove)) {
>> +		node = list_first_entry(&remove, typeof(*node), link);
>> +		list_del(&node->link);
>> +
>> +		/* Free up all the DRM references */
>> +		i915_scheduler_clean_node(node);
>> +
>> +		/* And anything else owned by the node: */
>> +		node->params.request->scheduler_qe = NULL;
>> +		i915_gem_request_unreference(node->params.request);
>> +		kfree(node->dep_list);
>> +		kfree(node);
> Maybe have a separate helper for freeing a node, would make this much
> cleaner.
There is a helper for freeing up the contents of the node - see 
i915_scheduler_clean_node() above. However, that function is called from 
multiple places and this is the only instance where the node itself is 
actually freed. The other callers either need to keep the node around 
for a while longer (i.e. until this piece of code can run) or don't own 
the node because it is stack allocated (in the bypass case). So the 
'anything else' section cannot be moved into the existing clean function 
and any secondary helper would be single usage and only four lines long. 
It doesn't seem worth it.

>
>> +	}
>> +
>> +	mutex_unlock(&ring->dev->struct_mutex);
>> +
>> +	if (do_submit)
>> +		intel_runtime_pm_put(dev_priv);
>> +}
>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
>> new file mode 100644
>> index 0000000..415fec8
>> --- /dev/null
>> +++ b/drivers/gpu/drm/i915/i915_scheduler.h
>> @@ -0,0 +1,95 @@
>> +/*
>> + * Copyright (c) 2014 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
>> + * IN THE SOFTWARE.
>> + *
>> + */
>> +
>> +#ifndef _I915_SCHEDULER_H_
>> +#define _I915_SCHEDULER_H_
>> +
>> +enum i915_scheduler_queue_status {
>> +	/* Limbo: */
>> +	i915_sqs_none = 0,
>> +	/* Not yet submitted to hardware: */
>> +	i915_sqs_queued,
>> +	/* Popped from queue, ready to fly: */
>> +	i915_sqs_popped,
>> +	/* Sent to hardware for processing: */
>> +	i915_sqs_flying,
>> +	/* Finished processing on the hardware: */
>> +	i915_sqs_complete,
>> +	/* Killed by watchdog or catastrophic submission failure: */
>> +	i915_sqs_dead,
>> +	/* Limit value for use with arrays/loops */
>> +	i915_sqs_MAX
>> +};
>> +
> To uppercase.
Grrr. Linux kernel style guide is daft.

>
>> +#define I915_SQS_IS_QUEUED(node)	(((node)->status == i915_sqs_queued))
>> +#define I915_SQS_IS_FLYING(node)	(((node)->status == i915_sqs_flying))
>> +#define I915_SQS_IS_COMPLETE(node)	(((node)->status == i915_sqs_complete) || \
>> +					 ((node)->status == i915_sqs_dead))
>> +
>> +struct i915_scheduler_obj_entry {
>> +	struct drm_i915_gem_object          *obj;
>> +	bool                                read_only;
>> +};
>> +
>> +struct i915_scheduler_queue_entry {
>> +	struct i915_execbuffer_params       params;
>> +	/* -1023 = lowest priority, 0 = default, 1023 = highest */
>> +	int32_t                             priority;
>> +	struct i915_scheduler_obj_entry     *saved_objects;
>> +	int                                 num_objs;
>> +	bool                                bumped;
>> +	struct i915_scheduler_queue_entry   **dep_list;
>> +	int                                 num_deps;
>> +	enum i915_scheduler_queue_status    status;
>> +	unsigned long                       stamp;
>> +	struct list_head                    link;
> Above whitespace is still incorrect and I could really use comments.
>
>> +};
>> +
>> +struct i915_scheduler {
>> +	struct list_head    node_queue[I915_NUM_RINGS];
>> +	uint32_t            flags[I915_NUM_RINGS];
>> +	spinlock_t          lock;
>> +
>> +	/* Tuning parameters: */
>> +	int32_t             priority_level_min;
>> +	int32_t             priority_level_max;
>> +	int32_t             priority_level_bump;
>> +	int32_t             priority_level_preempt;
>> +	uint32_t            min_flying;
>> +};
>> +
>> +/* Flag bits for i915_scheduler::flags */
>> +enum {
>> +	i915_sf_interrupts_enabled  = (1 << 0),
>> +	i915_sf_submitting          = (1 << 1),
>> +};
> These must be I915_UPPERCASE_SOMETHING by convention.
>
>> +
>> +bool i915_scheduler_is_enabled(struct drm_device *dev);
>> +int i915_scheduler_init(struct drm_device *dev);
>> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node);
>> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe);
>> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req);
>> +void i915_scheduler_wakeup(struct drm_device *dev);
>> +
>> +#endif  /* _I915_SCHEDULER_H_ */