[Intel-gfx] [PATCH v6 06/34] drm/i915: Start of GPU scheduler

Fri Jun 10 16:24:08 UTC 2016

Hi,

Just a few random comments/questions. (not a full review!)

On 20/04/16 18:13, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> Initial creation of scheduler source files. Note that this patch
> implements most of the scheduler functionality but does not hook it in
> to the driver yet. It also leaves the scheduler code in 'pass through'
> mode so that even when it is hooked in, it will not actually do very
> much. This allows the hooks to be added one at a time in bite size
> chunks and only when the scheduler is finally enabled at the end does
> anything start happening.
>
> The general theory of operation is that when batch buffers are
> submitted to the driver, the execbuffer() code packages up all the
> information required to execute the batch buffer at a later time. This
> package is given over to the scheduler which adds it to an internal
> node list. The scheduler also scans the list of objects associated
> with the batch buffer and compares them against the objects already in
> use by other buffers in the node list. If matches are found then the
> new batch buffer node is marked as being dependent upon the matching
> node. The same is done for the context object. The scheduler also
> bumps up the priority of such matching nodes on the grounds that the
> more dependencies a given batch buffer has the more important it is
> likely to be.
>
> The scheduler aims to have a given (tuneable) number of batch buffers
> in flight on the hardware at any given time. If fewer than this are
> currently executing when a new node is queued, then the node is passed
> straight through to the submit function. Otherwise it is simply added
> to the queue and the driver returns back to user land.
>
> The scheduler is notified when each batch buffer completes and updates
> its internal tracking accordingly. At the end of the completion
> interrupt processing, if any scheduler tracked batches were processed,
> the scheduler's deferred worker thread is woken up. This can do more
> involved processing such as actually removing completed nodes from the
> queue and freeing up the resources associated with them (internal
> memory allocations, DRM object references, context reference, etc.).
> The work handler also checks the in flight count and calls the
> submission code if a new slot has appeared.
>
> When the scheduler's submit code is called, it scans the queued node
> list for the highest priority node that has no unmet dependencies.
> Note that the dependency calculation is complex as it must take
> inter-ring dependencies and potential preemptions into account. Note
> also that in the future this will be extended to include external
> dependencies such as the Android Native Sync file descriptors and/or
> the linux dma-buff synchronisation scheme.
>
> If a suitable node is found then it is sent to execbuff_final() for
> submission to the hardware. The in flight count is then re-checked and
> a new node popped from the list if appropriate. All nodes that are not
> submitted have their priority bumped. This ensures that low priority
> tasks do not get starved out by busy higher priority ones - everything
> will eventually get its turn to run.
>
> Note that this patch does not implement pre-emptive scheduling. Only
> basic scheduling by re-ordering batch buffer submission is currently
> implemented. Pre-emption of actively executing batch buffers comes in
> the next patch series.
>
> v2: Changed priority levels to +/-1023 due to feedback from Chris
> Wilson.
>
> Removed redundant index from scheduler node.
>
> Changed time stamps to use jiffies instead of raw monotonic. This
> provides lower resolution but improved compatibility with other i915
> code.
>
> Major re-write of completion tracking code due to struct fence
> conversion. The scheduler no longer has it's own private IRQ handler
> but just lets the existing request code handle completion events.
> Instead, the scheduler now hooks into the request notify code to be
> told when a request has completed.
>
> Reduced driver mutex locking scope. Removal of scheduler nodes no
> longer grabs the mutex lock.
>
> v3: Refactor of dependency generation to make the code more readable.
> Also added in read-read optimisation support - i.e., don't treat a
> shared read-only buffer as being a dependency.
>
> Allowed the killing of queued nodes rather than only flying ones.
>
> v4: Updated the commit message to better reflect the current state of
> the code. Downgraded some BUG_ONs to WARN_ONs. Used the correct array
> memory allocator function (kmalloc_array instead of kmalloc).
> Corrected the format of some comments. Wrapped some lines differently
> to keep the style checker happy.
>
> Fixed a WARN_ON when killing nodes. The dependency removal code checks
> that nodes being destroyed do not have any oustanding dependencies
> (which would imply they should not have been executed yet). In the
> case of nodes being destroyed, e.g. due to context banning, then this
> might well be the case - they have not been executed and do indeed
> have outstanding dependencies.
>
> Re-instated the code to disble interrupts when not in use. The
> underlying problem causing broken IRQ reference counts seems to have
> been fixed now.
>
> v5: Shuffled various functions around to remove forward declarations
> as apparently these are frowned upon. Removed lots of white space as
> apparently having easy to read code is also frowned upon. Split the
> direct submission scheduler bypass code out into a separate function.
> Squashed down the i915_scheduler.c sections of various patches into
> this patch. Thus the later patches simply hook in existing code into
> various parts of the driver rather than adding the code as well. Added
> documentation to various functions. Re-worked the submit function in
> terms of mutex locking, error handling and exit paths. Split the
> delayed work handler function in half. Made use of the kernel 'clamp'
> macro. [Joonas Lahtinen]
>
> Added runtime PM calls as these must be done at the top level before
> acquiring the driver mutex lock. [Chris Wilson]
>
> Removed some obsolete debug code that had been forgotten about.
>
> Moved more clean up code into the 'i915_gem_scheduler_clean_node()'
> function rather than replicating it in mutliple places.
>
> Used lighter weight spinlocks.
>
> v6: Updated to newer nightly (lots of ring -> engine renaming).
>
> Added 'for_each_scheduler_node()' and 'assert_scheduler_lock_held()'
> helper macros. Renamed 'i915_gem_execbuff_release_batch_obj' to
> 'i915_gem_execbuf_release_batch_obj'. Updated to use 'to_i915()'
> instead of dev_private. Converted all enum labels to uppercase.
> Removed various unnecessary WARNs. Renamed 'saved_objects' to just
> 'objs'. Split code for counting incomplete nodes out into a separate
> function. Removed even more white space. Added a destroy() function.
> [review feedback from Joonas Lahtinen]
>
> Added running totals of 'flying' and 'queued' nodes rather than
> re-calculating each time as a minor CPU performance optimisation.
>
> Removed support for out of order seqno completion. All the prep work
> patch series (seqno to request conversion, late seqno assignment,
> etc.) that has now been done means that the scheduler no longer
> generates out of order seqno completions. Thus all the complex code
> for coping with such is no longer required and can be removed.
>
> Fixed a bug in scheduler bypass mode introduced in the clean up code
> refactoring of v5. The clean up function was seeing the node in the
> wrong state and thus refusing to process it.
>
> For: VIZ-1587
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> ---
>   drivers/gpu/drm/i915/Makefile         |   1 +
>   drivers/gpu/drm/i915/i915_dma.c       |   3 +
>   drivers/gpu/drm/i915/i915_drv.h       |   6 +
>   drivers/gpu/drm/i915/i915_gem.c       |   5 +
>   drivers/gpu/drm/i915/i915_scheduler.c | 867 ++++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_scheduler.h | 113 +++++
>   6 files changed, 995 insertions(+)
>   create mode 100644 drivers/gpu/drm/i915/i915_scheduler.c
>   create mode 100644 drivers/gpu/drm/i915/i915_scheduler.h
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index e9cdeb5..289fa73 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -10,6 +10,7 @@ ccflags-y := -Werror
>   i915-y := i915_drv.o \
>   	  i915_irq.o \
>   	  i915_params.o \
> +	  i915_scheduler.o \
>             i915_suspend.o \
>   	  i915_sysfs.o \
>   	  intel_csr.o \
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index b377753..2ad4071 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -37,6 +37,7 @@
>   #include "i915_drv.h"
>   #include "i915_vgpu.h"
>   #include "i915_trace.h"
> +#include "i915_scheduler.h"
>   #include <linux/pci.h>
>   #include <linux/console.h>
>   #include <linux/vt.h>
> @@ -1448,6 +1449,8 @@ int i915_driver_unload(struct drm_device *dev)
>
>   	intel_csr_ucode_fini(dev_priv);
>
> +	i915_scheduler_destroy(dev_priv);
> +
>   	/* Free error state after interrupts are fully disabled. */
>   	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
>   	i915_destroy_error_state(dev);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 7492ce7..7b62e2c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1717,6 +1717,8 @@ struct i915_execbuffer_params {
>   	struct drm_i915_gem_request     *request;
>   };
>
> +struct i915_scheduler;
> +
>   /* used in computing the new watermarks state */
>   struct intel_wm_config {
>   	unsigned int num_pipes_active;
> @@ -1994,6 +1996,8 @@ struct drm_i915_private {
>
>   	struct i915_runtime_pm pm;
>
> +	struct i915_scheduler *scheduler;
> +
>   	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
>   	struct {
>   		int (*execbuf_submit)(struct i915_execbuffer_params *params,
> @@ -2335,6 +2339,8 @@ struct drm_i915_gem_request {
>   	/** process identifier submitting this request */
>   	struct pid *pid;
>
> +	struct i915_scheduler_queue_entry *scheduler_qe;
> +
>   	/**
>   	 * The ELSP only accepts two elements at a time, so we queue
>   	 * context/tail pairs on a given queue (ring->execlist_queue) until the
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index a632276..b7466cb 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -32,6 +32,7 @@
>   #include "i915_vgpu.h"
>   #include "i915_trace.h"
>   #include "intel_drv.h"
> +#include "i915_scheduler.h"
>   #include <linux/shmem_fs.h>
>   #include <linux/slab.h>
>   #include <linux/swap.h>
> @@ -5405,6 +5406,10 @@ int i915_gem_init(struct drm_device *dev)
>   	 */
>   	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> +	ret = i915_scheduler_init(dev);
> +	if (ret)
> +		goto out_unlock;
> +
>   	ret = i915_gem_init_userptr(dev);
>   	if (ret)
>   		goto out_unlock;
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> new file mode 100644
> index 0000000..9d628b9
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> @@ -0,0 +1,867 @@
> +/*
> + * Copyright (c) 2014 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_drv.h"
> +#include "i915_scheduler.h"
> +
> +#define for_each_scheduler_node(node, id)				\
> +	list_for_each_entry((node), &scheduler->node_queue[(id)], link)
> +
> +#define assert_scheduler_lock_held(scheduler)				\
> +	do {								\
> +		WARN_ONCE(!spin_is_locked(&(scheduler)->lock), "Spinlock not locked!");	\
> +	} while(0)
> +
> +/**
> + * i915_scheduler_is_enabled - Returns true if the scheduler is enabled.
> + * @dev: DRM device
> + */
> +bool i915_scheduler_is_enabled(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(dev);
> +
> +	return dev_priv->scheduler != NULL;
> +}
> +
> +/**
> + * i915_scheduler_init - Initialise the scheduler.
> + * @dev: DRM device
> + * Returns zero on success or -ENOMEM if memory allocations fail.
> + */
> +int i915_scheduler_init(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	int e;
> +
> +	if (scheduler)
> +		return 0;

Probably a GEM_BUG_ON or something.

> +
> +	scheduler = kzalloc(sizeof(*scheduler), GFP_KERNEL);
> +	if (!scheduler)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&scheduler->lock);
> +
> +	for (e = 0; e < I915_NUM_ENGINES; e++) {
> +		INIT_LIST_HEAD(&scheduler->node_queue[e]);
> +		scheduler->counts[e].flying = 0;
> +		scheduler->counts[e].queued = 0;
> +	}
> +
> +	/* Default tuning values: */
> +	scheduler->priority_level_min     = -1023;
> +	scheduler->priority_level_max     = 1023;
> +	scheduler->priority_level_bump    = 50;
> +	scheduler->priority_level_preempt = 900;
> +	scheduler->min_flying             = 2;
> +
> +	dev_priv->scheduler = scheduler;
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_destroy - Get rid of the scheduler.
> + * @dev: DRM device
> + */
> +void i915_scheduler_destroy(struct drm_i915_private *dev_priv)
> +{
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	int e;
> +
> +	if (!scheduler)
> +		return;
> +
> +	for (e = 0; e < I915_NUM_ENGINES; e++)
> +		WARN(!list_empty(&scheduler->node_queue[e]), "Destroying with list entries on engine %d!", e);
> +
> +	kfree(scheduler);
> +	dev_priv->scheduler = NULL;
> +}
> +
> +/*
> + * Add a popped node back in to the queue. For example, because the engine
> + * was hung when execfinal() was called and thus the engine submission needs
> + * to be retried later.
> + */
> +static void i915_scheduler_node_requeue(struct i915_scheduler *scheduler,
> +					struct i915_scheduler_queue_entry *node)
> +{
> +	assert_scheduler_lock_held(scheduler);
> +
> +	WARN_ON(!I915_SQS_IS_FLYING(node));
> +
> +	/* Seqno will be reassigned on relaunch */
> +	node->params.request->seqno = 0;
> +	node->status = I915_SQS_QUEUED;
> +	scheduler->counts[node->params.engine->id].flying--;
> +	scheduler->counts[node->params.engine->id].queued++;
> +}
> +
> +/*
> + * Give up on a node completely. For example, because it is causing the
> + * engine to hang or is using some resource that no longer exists.
> + */
> +static void i915_scheduler_node_kill(struct i915_scheduler *scheduler,
> +				     struct i915_scheduler_queue_entry *node)
> +{
> +	assert_scheduler_lock_held(scheduler);
> +
> +	WARN_ON(I915_SQS_IS_COMPLETE(node));
> +
> +	if (I915_SQS_IS_FLYING(node))
> +		scheduler->counts[node->params.engine->id].flying--;
> +	else
> +		scheduler->counts[node->params.engine->id].queued--;
> +
> +	node->status = I915_SQS_DEAD;
> +}
> +
> +/* Mark a node as in flight on the hardware. */
> +static void i915_scheduler_node_fly(struct i915_scheduler_queue_entry *node)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(node->params.dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct intel_engine_cs *engine = node->params.engine;
> +
> +	assert_scheduler_lock_held(scheduler);
> +
> +	WARN_ON(node->status != I915_SQS_POPPED);
> +
> +	/*
> +	 * Add the node (which should currently be in state popped) to the
> +	 * front of the queue. This ensure that flying nodes are always held
> +	 * in hardware submission order.
> +	 */
> +	list_add(&node->link, &scheduler->node_queue[engine->id]);
> +
> +	node->status = I915_SQS_FLYING;
> +
> +	scheduler->counts[engine->id].flying++;
> +
> +	if (!(scheduler->flags[engine->id] & I915_SF_INTERRUPTS_ENABLED)) {
> +		bool success = true;
> +
> +		success = engine->irq_get(engine);
> +		if (success)
> +			scheduler->flags[engine->id] |= I915_SF_INTERRUPTS_ENABLED;
> +	}
> +}
> +
> +static inline uint32_t i915_scheduler_count_flying(struct i915_scheduler *scheduler,
> +					    struct intel_engine_cs *engine)
> +{
> +	return scheduler->counts[engine->id].flying;
> +}

Maybe num_flying would be a more obvious name, or at least less worry 
inducing.

> +
> +static void i915_scheduler_priority_bump_clear(struct i915_scheduler *scheduler)
> +{
> +	struct i915_scheduler_queue_entry *node;
> +	int i;
> +
> +	assert_scheduler_lock_held(scheduler);
> +
> +	/*
> +	 * Ensure circular dependencies don't cause problems and that a bump
> +	 * by object usage only bumps each using buffer once:
> +	 */
> +	for (i = 0; i < I915_NUM_ENGINES; i++) {
> +		for_each_scheduler_node(node, i)
> +			node->bumped = false;
> +	}
> +}
> +
> +static int i915_scheduler_priority_bump(struct i915_scheduler *scheduler,
> +				struct i915_scheduler_queue_entry *target,
> +				uint32_t bump)
> +{
> +	uint32_t new_priority;
> +	int i, count;
> +
> +	if (target->priority >= scheduler->priority_level_max)
> +		return 1;
> +
> +	if (target->bumped)
> +		return 0;
> +
> +	new_priority = target->priority + bump;
> +	if ((new_priority <= target->priority) ||
> +	    (new_priority > scheduler->priority_level_max))
> +		target->priority = scheduler->priority_level_max;
> +	else
> +		target->priority = new_priority;
> +
> +	count = 1;
> +	target->bumped = true;
> +
> +	for (i = 0; i < target->num_deps; i++) {
> +		if (!target->dep_list[i])
> +			continue;
> +
> +		if (target->dep_list[i]->bumped)
> +			continue;
> +
> +		count += i915_scheduler_priority_bump(scheduler,
> +						      target->dep_list[i],
> +						      bump);
> +	}
> +
> +	return count;
> +}
> +
> +/*
> + * Nodes are considered valid dependencies if they are queued on any engine
> + * or if they are in flight on a different engine. In flight on the same
> + * engine is no longer interesting for non-premptive nodes as the engine
> + * serialises execution. For pre-empting nodes, all in flight dependencies
> + * are valid as they must not be jumped by the act of pre-empting.
> + *
> + * Anything that is neither queued nor flying is uninteresting.
> + */
> +static inline bool i915_scheduler_is_dependency_valid(
> +			struct i915_scheduler_queue_entry *node, uint32_t idx)
> +{
> +	struct i915_scheduler_queue_entry *dep;
> +
> +	dep = node->dep_list[idx];
> +	if (!dep)
> +		return false;
> +
> +	if (I915_SQS_IS_QUEUED(dep))
> +		return true;
> +
> +	if (I915_SQS_IS_FLYING(dep)) {
> +		if (node->params.engine != dep->params.engine)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int i915_scheduler_pop_from_queue_locked(struct intel_engine_cs *engine,
> +				struct i915_scheduler_queue_entry **pop_node)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(engine->dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *best = NULL;
> +	struct i915_scheduler_queue_entry *node;
> +	int ret;
> +	int i;
> +	bool any_queued = false;
> +	bool has_local, has_remote, only_remote = false;
> +
> +	assert_scheduler_lock_held(scheduler);
> +
> +	*pop_node = NULL;

Looks to be not needed since there is *pop_node = best below.

> +	ret = -ENODATA;
> +
> +	for_each_scheduler_node(node, engine->id) {
> +		if (!I915_SQS_IS_QUEUED(node))
> +			continue;
> +		any_queued = true;
> +
> +		has_local  = false;
> +		has_remote = false;
> +		for (i = 0; i < node->num_deps; i++) {
> +			if (!i915_scheduler_is_dependency_valid(node, i))
> +				continue;
> +
> +			if (node->dep_list[i]->params.engine == node->params.engine)
> +				has_local = true;
> +			else
> +				has_remote = true;

Would it be worth breaking early from this loop if has_local && has_remote?

> +		}
> +
> +		if (has_remote && !has_local)
> +			only_remote = true;
> +
> +		if (!has_local && !has_remote) {
> +			if (!best ||
> +			    (node->priority > best->priority))
> +				best = node;
> +		}
> +	}
> +
> +	if (best) {
> +		list_del(&best->link);
> +
> +		INIT_LIST_HEAD(&best->link);

list_del_init can replace the two lines above.

> +		best->status = I915_SQS_POPPED;
> +
> +		scheduler->counts[engine->id].queued--;
> +
> +		ret = 0;
> +	} else {
> +		/* Can only get here if:
> +		 * (a) there are no buffers in the queue
> +		 * (b) all queued buffers are dependent on other buffers
> +		 *     e.g. on a buffer that is in flight on a different engine
> +		 */
> +		if (only_remote) {
> +			/* The only dependent buffers are on another engine. */
> +			ret = -EAGAIN;
> +		} else if (any_queued) {
> +			/* It seems that something has gone horribly wrong! */
> +			WARN_ONCE(true, "Broken dependency tracking on engine %d!\n",
> +				  (int) engine->id);
> +		}
> +	}
> +
> +	*pop_node = best;
> +	return ret;
> +}
> +
> +/*
> + * NB: The driver mutex lock must be held before calling this function. It is
> + * only really required during the actual back end submission call. However,
> + * attempting to acquire a mutex while holding a spin lock is a Bad Idea.
> + * And releasing the one before acquiring the other leads to other code
> + * being run and interfering.
> + */
> +static int i915_scheduler_submit(struct intel_engine_cs *engine)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(engine->dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node;
> +	int ret, count = 0, flying;
> +
> +	WARN_ON(!mutex_is_locked(&engine->dev->struct_mutex));
> +
> +	spin_lock_irq(&scheduler->lock);
> +
> +	WARN_ON(scheduler->flags[engine->id] & I915_SF_SUBMITTING);
> +	scheduler->flags[engine->id] |= I915_SF_SUBMITTING;
> +
> +	/* First time around, complain if anything unexpected occurs: */
> +	ret = i915_scheduler_pop_from_queue_locked(engine, &node);
> +	if (ret)
> +		goto error;
> +
> +	do {
> +		WARN_ON(node->params.engine != engine);
> +		WARN_ON(node->status != I915_SQS_POPPED);
> +		count++;
> +
> +		/*
> +		 * The call to pop above will have removed the node from the
> +		 * list. So add it back in and mark it as in flight.
> +		 */
> +		i915_scheduler_node_fly(node);
> +
> +		spin_unlock_irq(&scheduler->lock);
> +		ret = dev_priv->gt.execbuf_final(&node->params);
> +		spin_lock_irq(&scheduler->lock);
> +
> +		/*
> +		 * Handle failed submission but first check that the
> +		 * watchdog/reset code has not nuked the node while we
> +		 * weren't looking:
> +		 */
> +		if (ret && (node->status != I915_SQS_DEAD)) {
> +			bool requeue = true;
> +
> +			/*
> +			 * Oh dear! Either the node is broken or the engine is
> +			 * busy. So need to kill the node or requeue it and try
> +			 * again later as appropriate.
> +			 */
> +
> +			switch (-ret) {
> +			case ENODEV:
> +			case ENOENT:
> +				/* Fatal errors. Kill the node. */
> +				requeue = false;
> +				i915_scheduler_node_kill(scheduler, node);
> +				break;
> +
> +			case EAGAIN:
> +			case EBUSY:
> +			case EIO:
> +			case ENOMEM:
> +			case ERESTARTSYS:
> +			case EINTR:
> +				/* Supposedly recoverable errors. */
> +				break;
> +
> +			default:
> +				/*
> +				 * Assume the error is recoverable and hope
> +				 * for the best.
> +				 */
> +				MISSING_CASE(-ret);
> +				break;
> +			}
> +
> +			if (requeue) {
> +				i915_scheduler_node_requeue(scheduler, node);
> +				/*
> +				 * No point spinning if the engine is currently
> +				 * unavailable so just give up and come back
> +				 * later.
> +				 */
> +				break;
> +			}
> +		}
> +
> +		/* Keep launching until the sky is sufficiently full. */
> +		flying = i915_scheduler_count_flying(scheduler, engine);
> +		if (flying >= scheduler->min_flying)
> +			break;
> +
> +		/* Grab another node and go round again... */
> +		ret = i915_scheduler_pop_from_queue_locked(engine, &node);
> +	} while (ret == 0);
> +
> +	/* Don't complain about not being able to submit extra entries */
> +	if (ret == -ENODATA)
> +		ret = 0;
> +
> +	/*
> +	 * Bump the priority of everything that was not submitted to prevent
> +	 * starvation of low priority tasks by a spamming high priority task.
> +	 */
> +	i915_scheduler_priority_bump_clear(scheduler);
> +	for_each_scheduler_node(node, engine->id) {
> +		if (!I915_SQS_IS_QUEUED(node))
> +			continue;
> +
> +		i915_scheduler_priority_bump(scheduler, node,
> +					     scheduler->priority_level_bump);
> +	}

bump_clear will iterate queues for all engines and then you iterate it 
again. Would this be equivalent:

        for (i = 0; i < I915_NUM_ENGINES; i++) {
                for_each_scheduler_node(node, i) {
                        node->bumped = false;
                        if (i == engine->id && I915_SQS_IS_QUEUED(node))
		i915_scheduler_priority_bump(scheduler, 		node, 
scheduler->priority_level_bump);
                }
         }

Advantage is one loop fewer.

> +
> +	/* On success, return the number of buffers submitted. */
> +	if (ret == 0)
> +		ret = count;
> +
> +error:
> +	scheduler->flags[engine->id] &= ~I915_SF_SUBMITTING;
> +	spin_unlock_irq(&scheduler->lock);
> +	return ret;
> +}
> +
> +static void i915_generate_dependencies(struct i915_scheduler *scheduler,
> +				       struct i915_scheduler_queue_entry *node,
> +				       uint32_t engine)
> +{
> +	struct i915_scheduler_obj_entry *this, *that;
> +	struct i915_scheduler_queue_entry *test;
> +	int i, j;
> +	bool found;
> +
> +	for_each_scheduler_node(test, engine) {
> +		if (I915_SQS_IS_COMPLETE(test))
> +			continue;
> +
> +		/*
> +		 * Batches on the same engine for the same
> +		 * context must be kept in order.
> +		 */
> +		found = (node->params.ctx == test->params.ctx) &&
> +			(node->params.engine == test->params.engine);
> +
> +		/*
> +		 * Batches working on the same objects must
> +		 * be kept in order.
> +		 */
> +		for (i = 0; (i < node->num_objs) && !found; i++) {
> +			this = node->objs + i;
> +
> +			for (j = 0; j < test->num_objs; j++) {
> +				that = test->objs + j;
> +
> +				if (this->obj != that->obj)
> +					continue;
> +
> +				/* Only need to worry about writes */
> +				if (this->read_only && that->read_only)
> +					continue;
> +
> +				found = true;
> +				break;
> +			}
> +		}
> +
> +		if (found) {
> +			node->dep_list[node->num_deps] = test;
> +			node->num_deps++;
> +		}
> +	}
> +}
> +
> +static int i915_scheduler_queue_execbuffer_bypass(struct i915_scheduler_queue_entry *qe)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(qe->params.dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	int ret;
> +
> +	scheduler->flags[qe->params.engine->id] |= I915_SF_SUBMITTING;
> +	ret = dev_priv->gt.execbuf_final(&qe->params);
> +	scheduler->flags[qe->params.engine->id] &= ~I915_SF_SUBMITTING;
> +
> +	/*
> +	 * Don't do any clean up on failure because the caller will
> +	 * do it all anyway.
> +	 */
> +	if (ret)
> +		return ret;
> +
> +	/* Need to release any resources held by the node: */
> +	qe->status = I915_SQS_COMPLETE;
> +	i915_scheduler_clean_node(qe);
> +
> +	return 0;
> +}
> +
> +static inline uint32_t i915_scheduler_count_incomplete(struct i915_scheduler *scheduler)
> +{
> +	int e, incomplete = 0;
> +
> +	for (e = 0; e < I915_NUM_ENGINES; e++)
> +		incomplete += scheduler->counts[e].queued + scheduler->counts[e].flying;
> +
> +	return incomplete;
> +}
> +
> +/**
> + * i915_scheduler_queue_execbuffer - Submit a batch buffer request to the
> + * scheduler.
> + * @qe: The batch buffer request to be queued.
> + * The expectation is the qe passed in is a local stack variable. This
> + * function will copy its contents into a freshly allocated list node. The
> + * new node takes ownership of said contents so the original qe should simply
> + * be discarded and not cleaned up (i.e. don't free memory it points to or
> + * dereference objects it holds). The node is added to the scheduler's queue
> + * and the batch buffer will be submitted to the hardware at some future
> + * point in time (which may be immediately, before returning or may be quite
> + * a lot later).
> + */
> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(qe->params.dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct intel_engine_cs *engine = qe->params.engine;
> +	struct i915_scheduler_queue_entry *node;
> +	bool not_flying;
> +	int i, e;
> +	int incomplete;
> +
> +	/* Bypass the scheduler and send the buffer immediately? */
> +	if (1/*!i915.enable_scheduler*/)
> +		return i915_scheduler_queue_execbuffer_bypass(qe);
> +
> +	node = kmalloc(sizeof(*node), GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	*node = *qe;
> +	INIT_LIST_HEAD(&node->link);
> +	node->status = I915_SQS_QUEUED;
> +	node->stamp  = jiffies;
> +	i915_gem_request_reference(node->params.request);
> +
> +	WARN_ON(node->params.request->scheduler_qe);
> +	node->params.request->scheduler_qe = node;
> +
> +	/*
> +	 * Need to determine the number of incomplete entries in the list as
> +	 * that will be the maximum size of the dependency list.
> +	 *
> +	 * Note that the allocation must not be made with the spinlock acquired
> +	 * as kmalloc can sleep. However, the unlock/relock is safe because no
> +	 * new entries can be queued up during the unlock as the i915 driver
> +	 * mutex is still held. Entries could be removed from the list but that
> +	 * just means the dep_list will be over-allocated which is fine.
> +	 */
> +	spin_lock_irq(&scheduler->lock);
> +	incomplete = i915_scheduler_count_incomplete(scheduler);
> +
> +	/* Temporarily unlock to allocate memory: */
> +	spin_unlock_irq(&scheduler->lock);
> +	if (incomplete) {
> +		node->dep_list = kmalloc_array(incomplete,
> +					       sizeof(*node->dep_list),
> +					       GFP_KERNEL);
> +		if (!node->dep_list) {
> +			kfree(node);
> +			return -ENOMEM;
> +		}
> +	} else
> +		node->dep_list = NULL;
> +
> +	spin_lock_irq(&scheduler->lock);
> +	node->num_deps = 0;
> +
> +	if (node->dep_list) {
> +		for (e = 0; e < I915_NUM_ENGINES; e++)
> +			i915_generate_dependencies(scheduler, node, e);
> +
> +		WARN_ON(node->num_deps > incomplete);
> +	}
> +
> +	node->priority = clamp(node->priority,
> +			       scheduler->priority_level_min,
> +			       scheduler->priority_level_max);
> +
> +	if ((node->priority > 0) && node->num_deps) {
> +		i915_scheduler_priority_bump_clear(scheduler);
> +
> +		for (i = 0; i < node->num_deps; i++)
> +			i915_scheduler_priority_bump(scheduler,
> +					node->dep_list[i], node->priority);
> +	}
> +
> +	list_add_tail(&node->link, &scheduler->node_queue[engine->id]);
> +
> +	not_flying = i915_scheduler_count_flying(scheduler, engine) <
> +						 scheduler->min_flying;
> +
> +	scheduler->counts[engine->id].queued++;
> +
> +	spin_unlock_irq(&scheduler->lock);
> +
> +	if (not_flying)
> +		i915_scheduler_submit(engine);
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_notify_request - Notify the scheduler that the given
> + * request has completed on the hardware.
> + * @req: Request structure which has completed
> + * @preempt: Did it complete pre-emptively?
> + * A sequence number has popped out of the hardware and the request handling
> + * code has mapped it back to a request and will mark that request complete.
> + * It also calls this function to notify the scheduler about the completion
> + * so the scheduler's node can be updated appropriately.
> + * Returns true if the request is scheduler managed, false if not. The return
> + * value is combined for all freshly completed requests and if any were true
> + * then i915_scheduler_wakeup() is called so the scheduler can do further
> + * processing (submit more work) at the end.
> + */
> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(req->engine->dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node = req->scheduler_qe;
> +	unsigned long flags;
> +
> +	if (!node)
> +		return false;
> +
> +	spin_lock_irqsave(&scheduler->lock, flags);
> +
> +	WARN_ON(!I915_SQS_IS_FLYING(node));
> +
> +	/* Node was in flight so mark it as complete. */
> +	if (req->cancelled)
> +		node->status = I915_SQS_DEAD;
> +	else
> +		node->status = I915_SQS_COMPLETE;
> +
> +	scheduler->counts[req->engine->id].flying--;
> +
> +	spin_unlock_irqrestore(&scheduler->lock, flags);
> +
> +	return true;
> +}
> +
> +static int i915_scheduler_remove_dependent(struct i915_scheduler *scheduler,
> +				struct i915_scheduler_queue_entry *remove)
> +{
> +	struct i915_scheduler_queue_entry *node;
> +	int i, r;
> +	int count = 0;
> +
> +	/*
> +	 * Ensure that a node is not being removed which is still dependent
> +	 * upon other (not completed) work. If that happens, it implies
> +	 * something has gone very wrong with the dependency tracking! Note
> +	 * that there is no need to worry if this node has been explicitly
> +	 * killed for some reason - it might be being killed before it got
> +	 * sent to the hardware.
> +	 */
> +	if (remove->status != I915_SQS_DEAD) {
> +		for (i = 0; i < remove->num_deps; i++)
> +			if ((remove->dep_list[i]) &&
> +			    (!I915_SQS_IS_COMPLETE(remove->dep_list[i])))
> +				count++;
> +		WARN_ON(count);
> +	}
> +
> +	/*
> +	 * Remove this node from the dependency lists of any other node which
> +	 * might be waiting on it.
> +	 */
> +	for (r = 0; r < I915_NUM_ENGINES; r++) {
> +		for_each_scheduler_node(node, r) {
> +			for (i = 0; i < node->num_deps; i++) {
> +				if (node->dep_list[i] != remove)
> +					continue;
> +
> +				node->dep_list[i] = NULL;

Can the same node be listed in other's node dependency list multiple 
times? If not you could break after clearing it and not iterate the rest 
of the list.

> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_wakeup - wake the scheduler's worker thread
> + * @dev: DRM device
> + * Called at the end of seqno interrupt processing if any request has
> + * completed that corresponds to a scheduler node.
> + */
> +void i915_scheduler_wakeup(struct drm_device *dev)
> +{
> +	/* XXX: Need to call i915_scheduler_remove() via work handler. */
> +}
> +
> +/**
> + * i915_scheduler_clean_node - free up any allocations/references
> + * associated with the given scheduler queue entry.
> + * @node: Queue entry structure which is complete
> + * After a give batch buffer completes on the hardware, all the information
> + * required to resubmit it is no longer required. However, the node entry
> + * itself might still be required for tracking purposes for a while longer.
> + * This function should be called as soon as the node is known to be complete
> + * so that these resources may be freed even though the node itself might
> + * hang around.
> + */
> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node)
> +{
> +	if (!I915_SQS_IS_COMPLETE(node)) {
> +		WARN(!node->params.request->cancelled,
> +		     "Cleaning active node: %d!\n", node->status);
> +		return;
> +	}
> +
> +	if (node->params.batch_obj) {
> +		/*
> +		 * The batch buffer must be unpinned before it is unreferenced
> +		 * otherwise the unpin fails with a missing vma!?
> +		 */
> +		if (node->params.dispatch_flags & I915_DISPATCH_SECURE)
> +			i915_gem_execbuf_release_batch_obj(node->params.batch_obj);
> +
> +		node->params.batch_obj = NULL;
> +	}
> +
> +	/* And anything else owned by the node: */
> +	if (node->params.cliprects) {
> +		kfree(node->params.cliprects);
> +		node->params.cliprects = NULL;
> +	}
> +}
> +
> +static bool i915_scheduler_remove(struct i915_scheduler *scheduler,
> +				  struct intel_engine_cs *engine,
> +				  struct list_head *remove)
> +{
> +	struct i915_scheduler_queue_entry *node, *node_next;
> +	bool do_submit;
> +
> +	spin_lock_irq(&scheduler->lock);
> +
> +	INIT_LIST_HEAD(remove);
> +	list_for_each_entry_safe(node, node_next, &scheduler->node_queue[engine->id], link) {
> +		if (!I915_SQS_IS_COMPLETE(node))
> +			break;
> +
> +		list_del(&node->link);
> +		list_add(&node->link, remove);

list_move

> +
> +		/* Strip the dependency info while the mutex is still locked */
> +		i915_scheduler_remove_dependent(scheduler, node);
> +
> +		continue;

Could kill the continue. :)

> +	}
> +
> +	/*
> +	 * Release the interrupt reference count if there are no longer any
> +	 * nodes to worry about.
> +	 */
> +	if (list_empty(&scheduler->node_queue[engine->id]) &&
> +	    (scheduler->flags[engine->id] & I915_SF_INTERRUPTS_ENABLED)) {
> +		engine->irq_put(engine);
> +		scheduler->flags[engine->id] &= ~I915_SF_INTERRUPTS_ENABLED;
> +	}
> +
> +	/* Launch more packets now? */
> +	do_submit = (scheduler->counts[engine->id].queued > 0) &&
> +		    (scheduler->counts[engine->id].flying < scheduler->min_flying);
> +
> +	spin_unlock_irq(&scheduler->lock);
> +
> +	return do_submit;
> +}
> +
> +void i915_scheduler_process_work(struct intel_engine_cs *engine)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(engine->dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node;
> +	bool do_submit;
> +	struct list_head remove;

Could do LIST_HEAD(remove); to declare an empty list and then wouldn't 
have to do INIT_LIST_HEAD in i915_scheduler_remove. Would probably be 
better to group the logic together like that.

> +
> +	if (list_empty(&scheduler->node_queue[engine->id]))
> +		return;
> +
> +	/* Remove completed nodes. */
> +	do_submit = i915_scheduler_remove(scheduler, engine, &remove);
> +
> +	if (!do_submit && list_empty(&remove))
> +		return;
> +
> +	/* Need to grab the pm lock outside of the mutex lock */
> +	if (do_submit)
> +		intel_runtime_pm_get(dev_priv);
> +
> +	mutex_lock(&engine->dev->struct_mutex);
> +
> +	if (do_submit)
> +		i915_scheduler_submit(engine);
> +
> +	while (!list_empty(&remove)) {
> +		node = list_first_entry(&remove, typeof(*node), link);
> +		list_del(&node->link);
> +
> +		/* Free up all the DRM references */
> +		i915_scheduler_clean_node(node);
> +
> +		/* And anything else owned by the node: */
> +		node->params.request->scheduler_qe = NULL;
> +		i915_gem_request_unreference(node->params.request);
> +		kfree(node->dep_list);
> +		kfree(node);
> +	}
> +
> +	mutex_unlock(&engine->dev->struct_mutex);
> +
> +	if (do_submit)
> +		intel_runtime_pm_put(dev_priv);
> +}
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
> new file mode 100644
> index 0000000..c895c4c
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_scheduler.h
> @@ -0,0 +1,113 @@
> +/*
> + * Copyright (c) 2014 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef _I915_SCHEDULER_H_
> +#define _I915_SCHEDULER_H_
> +
> +enum i915_scheduler_queue_status {
> +	/* Limbo: */
> +	I915_SQS_NONE = 0,
> +	/* Not yet submitted to hardware: */
> +	I915_SQS_QUEUED,
> +	/* Popped from queue, ready to fly: */
> +	I915_SQS_POPPED,
> +	/* Sent to hardware for processing: */
> +	I915_SQS_FLYING,
> +	/* Finished processing on the hardware: */
> +	I915_SQS_COMPLETE,
> +	/* Killed by watchdog or catastrophic submission failure: */
> +	I915_SQS_DEAD,
> +	/* Limit value for use with arrays/loops */
> +	I915_SQS_MAX
> +};
> +
> +#define I915_SQS_IS_QUEUED(node)	(((node)->status == I915_SQS_QUEUED))
> +#define I915_SQS_IS_FLYING(node)	(((node)->status == I915_SQS_FLYING))
> +#define I915_SQS_IS_COMPLETE(node)	(((node)->status == I915_SQS_COMPLETE) || \
> +					 ((node)->status == I915_SQS_DEAD))
> +
> +struct i915_scheduler_obj_entry {
> +	struct drm_i915_gem_object *obj;
> +	bool read_only;
> +};
> +
> +struct i915_scheduler_queue_entry {
> +	/* Any information required to submit this batch buffer to the hardware */
> +	struct i915_execbuffer_params params;
> +
> +	/* -1023 = lowest priority, 0 = default, 1023 = highest */
> +	int32_t priority;
> +	bool bumped;
> +
> +	/* Objects referenced by this batch buffer */
> +	struct i915_scheduler_obj_entry *objs;
> +	int num_objs;
> +
> +	/* Batch buffers this one is dependent upon */
> +	struct i915_scheduler_queue_entry **dep_list;
> +	int num_deps;
> +
> +	enum i915_scheduler_queue_status status;
> +	unsigned long stamp;
> +
> +	/* List of all scheduler queue entry nodes */
> +	struct list_head link;
> +};
> +
> +struct i915_scheduler_node_states {
> +	uint32_t flying;
> +	uint32_t queued;
> +};
> +
> +struct i915_scheduler {
> +	struct list_head node_queue[I915_NUM_ENGINES];
> +	uint32_t flags[I915_NUM_ENGINES];
> +	spinlock_t lock;
> +
> +	/* Node counts: */
> +	struct i915_scheduler_node_states counts[I915_NUM_ENGINES];
> +
> +	/* Tuning parameters: */
> +	int32_t priority_level_min;
> +	int32_t priority_level_max;
> +	int32_t priority_level_bump;
> +	int32_t priority_level_preempt;
> +	uint32_t min_flying;
> +};
> +
> +/* Flag bits for i915_scheduler::flags */
> +enum {
> +	I915_SF_INTERRUPTS_ENABLED  = (1 << 0),
> +	I915_SF_SUBMITTING          = (1 << 1),
> +};
> +
> +bool i915_scheduler_is_enabled(struct drm_device *dev);
> +int i915_scheduler_init(struct drm_device *dev);
> +void i915_scheduler_destroy(struct drm_i915_private *dev_priv);
> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node);
> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe);
> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req);
> +void i915_scheduler_wakeup(struct drm_device *dev);
> +
> +#endif  /* _I915_SCHEDULER_H_ */
>

Regards,

Tvrtko