[Intel-gfx] [PATCH v5 06/35] drm/i915: Start of GPU scheduler

Fri Feb 19 13:03:01 UTC 2016

Hi,

Now the code is in reviewable chunks, excellent!

I've added my comments below. A few repeats from last round, but now
with more questions about the logic itself.

On to, 2016-02-18 at 14:26 +0000, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
> 
> Initial creation of scheduler source files. Note that this patch
> implements most of the scheduler functionality but does not hook it in
> to the driver yet. It also leaves the scheduler code in 'pass through'
> mode so that even when it is hooked in, it will not actually do very
> much. This allows the hooks to be added one at a time in bite size
> chunks and only when the scheduler is finally enabled at the end does
> anything start happening.
> 
> The general theory of operation is that when batch buffers are
> submitted to the driver, the execbuffer() code packages up all the
> information required to execute the batch buffer at a later time. This
> package is given over to the scheduler which adds it to an internal
> node list. The scheduler also scans the list of objects associated
> with the batch buffer and compares them against the objects already in
> use by other buffers in the node list. If matches are found then the
> new batch buffer node is marked as being dependent upon the matching
> node. The same is done for the context object. The scheduler also
> bumps up the priority of such matching nodes on the grounds that the
> more dependencies a given batch buffer has the more important it is
> likely to be.
> 
> The scheduler aims to have a given (tuneable) number of batch buffers
> in flight on the hardware at any given time. If fewer than this are
> currently executing when a new node is queued, then the node is passed
> straight through to the submit function. Otherwise it is simply added
> to the queue and the driver returns back to user land.
> 
> The scheduler is notified when each batch buffer completes and updates
> its internal tracking accordingly. At the end of the completion
> interrupt processing, if any scheduler tracked batches were processed,
> the scheduler's deferred worker thread is woken up. This can do more
> involved processing such as actually removing completed nodes from the
> queue and freeing up the resources associated with them (internal
> memory allocations, DRM object references, context reference, etc.).
> The work handler also checks the in flight count and calls the
> submission code if a new slot has appeared.
> 
> When the scheduler's submit code is called, it scans the queued node
> list for the highest priority node that has no unmet dependencies.
> Note that the dependency calculation is complex as it must take
> inter-ring dependencies and potential preemptions into account. Note
> also that in the future this will be extended to include external
> dependencies such as the Android Native Sync file descriptors and/or
> the linux dma-buff synchronisation scheme.
> 
> If a suitable node is found then it is sent to execbuff_final() for
> submission to the hardware. The in flight count is then re-checked and
> a new node popped from the list if appropriate. All nodes that are not
> submitted have their priority bumped. This ensures that low priority
> tasks do not get starved out by busy higher priority ones - everything
> will eventually get its turn to run.
> 
> Note that this patch does not implement pre-emptive scheduling. Only
> basic scheduling by re-ordering batch buffer submission is currently
> implemented. Pre-emption of actively executing batch buffers comes in
> the next patch series.
> 
> v2: Changed priority levels to +/-1023 due to feedback from Chris
> Wilson.
> 
> Removed redundant index from scheduler node.
> 
> Changed time stamps to use jiffies instead of raw monotonic. This
> provides lower resolution but improved compatibility with other i915
> code.
> 
> Major re-write of completion tracking code due to struct fence
> conversion. The scheduler no longer has it's own private IRQ handler
> but just lets the existing request code handle completion events.
> Instead, the scheduler now hooks into the request notify code to be
> told when a request has completed.
> 
> Reduced driver mutex locking scope. Removal of scheduler nodes no
> longer grabs the mutex lock.
> 
> v3: Refactor of dependency generation to make the code more readable.
> Also added in read-read optimisation support - i.e., don't treat a
> shared read-only buffer as being a dependency.
> 
> Allowed the killing of queued nodes rather than only flying ones.
> 
> v4: Updated the commit message to better reflect the current state of
> the code. Downgraded some BUG_ONs to WARN_ONs. Used the correct array
> memory allocator function (kmalloc_array instead of kmalloc).
> Corrected the format of some comments. Wrapped some lines differently
> to keep the style checker happy.
> 
> Fixed a WARN_ON when killing nodes. The dependency removal code checks
> that nodes being destroyed do not have any oustanding dependencies
> (which would imply they should not have been executed yet). In the
> case of nodes being destroyed, e.g. due to context banning, then this
> might well be the case - they have not been executed and do indeed
> have outstanding dependencies.
> 
> Re-instated the code to disble interrupts when not in use. The
> underlying problem causing broken IRQ reference counts seems to have
> been fixed now.
> 
> v5: Shuffled various functions around to remove forward declarations
> as apparently these are frowned upon. Removed lots of white space as
> apparently having easy to read code is also frowned upon. Split the
> direct submission scheduler bypass code out into a separate function.
> Squashed down the i915_scheduler.c sections of various patches into
> this patch. Thus the later patches simply hook in existing code into
> various parts of the driver rather than adding the code as well. Added
> documentation to various functions. Re-worked the submit function in
> terms of mutex locking, error handling and exit paths. Split the
> delayed work handler function in half. Made use of the kernel 'clamp'
> macro. [Joonas Lahtinen]
> 
> Added runtime PM calls as these must be done at the top level before
> acquiring the driver mutex lock. [Chris Wilson]
> 
> Removed some obsolete debug code that had been forgotten about.
> 
> Moved more clean up code into the 'i915_gem_scheduler_clean_node()'
> function rather than replicating it in mutliple places.
> 
> Used lighter weight spinlocks.
> 
> For: VIZ-1587
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile         |   1 +
>  drivers/gpu/drm/i915/i915_drv.h       |   6 +
>  drivers/gpu/drm/i915/i915_gem.c       |   5 +
>  drivers/gpu/drm/i915/i915_scheduler.c | 874 ++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_scheduler.h |  95 ++++
>  5 files changed, 981 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/i915_scheduler.c
>  create mode 100644 drivers/gpu/drm/i915/i915_scheduler.h
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 15398c5..79cb38b 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -10,6 +10,7 @@ ccflags-y := -Werror
>  i915-y := i915_drv.o \
>  	  i915_irq.o \
>  	  i915_params.o \
> +	  i915_scheduler.o \
>            i915_suspend.o \
>  	  i915_sysfs.o \
>  	  intel_csr.o \
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index f4487b9..03add1a 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1703,6 +1703,8 @@ struct i915_execbuffer_params {
>  	struct drm_i915_gem_request     *request;
>  };
>  
> +struct i915_scheduler;
> +
>  /* used in computing the new watermarks state */
>  struct intel_wm_config {
>  	unsigned int num_pipes_active;
> @@ -1968,6 +1970,8 @@ struct drm_i915_private {
>  
>  	struct i915_runtime_pm pm;
>  
> +	struct i915_scheduler *scheduler;
> +

I think we should have i915.enable_scheduler parameter (just like
i915.enable_execlists) and make this a data member, not pointer.

Then we can go forward and have i915.enable_scheduler_preempt and so on
to control things at boot time.

>  	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
>  	struct {
>  		int (*execbuf_submit)(struct i915_execbuffer_params *params,
> @@ -2290,6 +2294,8 @@ struct drm_i915_gem_request {
>  	/** process identifier submitting this request */
>  	struct pid *pid;
>  
> +	struct i915_scheduler_queue_entry	*scheduler_qe;
> +

Funny whitespace.

>  	/**
>  	 * The ELSP only accepts two elements at a time, so we queue
>  	 * context/tail pairs on a given queue (ring->execlist_queue) until the
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index dfe43ea..7d9aa24 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -32,6 +32,7 @@
>  #include "i915_vgpu.h"
>  #include "i915_trace.h"
>  #include "intel_drv.h"
> +#include "i915_scheduler.h"
>  #include 
>  #include 
>  #include 
> @@ -5315,6 +5316,10 @@ int i915_gem_init(struct drm_device *dev)
>  	 */
>  	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>  
> +	ret = i915_scheduler_init(dev);
> +	if (ret)
> +		goto out_unlock;
> +
>  	ret = i915_gem_init_userptr(dev);
>  	if (ret)
>  		goto out_unlock;
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> new file mode 100644
> index 0000000..fc23ee7
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> @@ -0,0 +1,874 @@
> +/*
> + * Copyright (c) 2014 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_drv.h"
> +#include "i915_scheduler.h"
> +
> +/**
> + * i915_scheduler_is_enabled - Returns true if the scheduler is enabled.
> + * @dev: DRM device
> + */
> +bool i915_scheduler_is_enabled(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +
> +	return dev_priv->scheduler != NULL;
> +}

i915.enable_scheduler would be used instead.

> +
> +/**
> + * i915_scheduler_init - Initialise the scheduler.
> + * @dev: DRM device
> + * Returns zero on success or -ENOMEM if memory allocations fail.
> + */
> +int i915_scheduler_init(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	int r;
> +

This protection -->

> +	if (scheduler)
> +		return 0;
> +
> +	scheduler = kzalloc(sizeof(*scheduler), GFP_KERNEL);
> +	if (!scheduler)
> +		return -ENOMEM;
> +

Is not needed if we have data member and i915.enable_scheduler.

> +	spin_lock_init(&scheduler->lock);
> +
> +	for (r = 0; r < I915_NUM_RINGS; r++)
> +		INIT_LIST_HEAD(&scheduler->node_queue[r]);
> +
> +	/* Default tuning values: */
> +	scheduler->priority_level_min     = -1023;
> +	scheduler->priority_level_max     = 1023;
> +	scheduler->priority_level_bump    = 50;
> +	scheduler->priority_level_preempt = 900;
> +	scheduler->min_flying             = 2;
> +
> +	dev_priv->scheduler = scheduler;
> +
> +	return 0;
> +}
> +
> +/*
> + * Add a popped node back in to the queue. For example, because the ring was
> + * hung when execfinal() was called and thus the ring submission needs to be
> + * retried later.
> + */
> +static void i915_scheduler_node_requeue(struct i915_scheduler_queue_entry *node)
> +{
> +	WARN_ON(!node);

Rather remove this completely as it is an internal function, or use
BUG_ON because next line will straight lead to OOPS after warning. I'll
pass mentioning rest of the obvious WARN_ON vs BUG_ON's.

> +	WARN_ON(!I915_SQS_IS_FLYING(node));
> +
> +	/* Seqno will be reassigned on relaunch */
> +	node->params.request->seqno = 0;
> +	node->status = i915_sqs_queued;
> +}
> +
> +/*
> + * Give up on a node completely. For example, because it is causing the
> + * ring to hang or is using some resource that no longer exists.
> + */
> +static void i915_scheduler_node_kill(struct i915_scheduler_queue_entry *node)
> +{
> +	WARN_ON(!node);
> +	WARN_ON(I915_SQS_IS_COMPLETE(node));
> +
> +	node->status = i915_sqs_dead;
> +}
> +
> +/* Mark a node as in flight on the hardware. */

As for documentation, from below, I only assume scheduler lock must be
held and node's can be only manipulated when scheduler lock is held?
It's good to add a comment describing the expected locking. Or maybe
add assert_scheduler_lock_held() helper and sprinkle it around the
functions similar to assert_rpm_wakelock_held(), which is self-
documenting?

> +static int i915_scheduler_node_fly(struct i915_scheduler_queue_entry *node)
> +{
> +	struct drm_i915_private *dev_priv = node->params.dev->dev_private;

If node is NULL, this would already OOPS, I do not think it's
unreasonable to expect node not to be NULL. The below WARN_ON check is
never reached.

> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct intel_engine_cs *ring;
> +
> +	WARN_ON(!scheduler);
> +	WARN_ON(!node);
> +	WARN_ON(node->status != i915_sqs_popped);
> +
> +	ring = node->params.ring;
> +

This and similar assignments can be squashed to declaration line, as
this is basically an alias.

> +	/*
> +	 * Add the node (which should currently be in state popped) to the
> +	 * front of the queue. This ensure that flying nodes are always held
> +	 * in hardware submission order.
> +	 */
> +	list_add(&node->link, &scheduler->node_queue[ring->id]);
> +
> +	node->status = i915_sqs_flying;
> +
> +	if (!(scheduler->flags[ring->id] & i915_sf_interrupts_enabled)) {
> +		bool success = true;
> +
> +		success = ring->irq_get(ring);
> +		if (success)
> +			scheduler->flags[ring->id] |= i915_sf_interrupts_enabled;
> +		else
> +			return -EINVAL;

Shouldn't the node be cleaned up from node_queue in case of an error?

Also, I think above could be written much more compact form (because it
looks like something where more logic will be added later). It makes it
easier to write and visualize the error paths when reading if there are
no nested if's.

I won't mention about the error paths of functions below, I expect the
following guidline to be adhered;

	if (scheduler->flags[ring->id] & I915_SF_INT_ENABLED)
		goto out;

	if (!ring->irq_get(ring))
		goto err_irq;

	if (!...)
		goto err_foo;

	scheduler->flags[ring->id] |= I915_SF_INT_ENABLED;
out:
	return 0;
err_foo:
	foobar();
err_irq:
	list_remove()
	return -EINVAL;

> +	}
> +
> +	return 0;
> +}
> +
> +static uint32_t i915_scheduler_count_flying(struct i915_scheduler *scheduler,
> +					    struct intel_engine_cs *ring)
> +{
> +	struct i915_scheduler_queue_entry *node;
> +	uint32_t flying = 0;
> +
> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link)

+1 for the for_each_scheduler_node(...)

> +		if (I915_SQS_IS_FLYING(node))
> +			flying++;
> +
> +	return flying;
> +}
> +
> +static void i915_scheduler_priority_bump_clear(struct i915_scheduler *scheduler)
> +{
> +	struct i915_scheduler_queue_entry *node;
> +	int i;
> +
> +	/*
> +	 * Ensure circular dependencies don't cause problems and that a bump
> +	 * by object usage only bumps each using buffer once:
> +	 */
> +	for (i = 0; i < I915_NUM_RINGS; i++) {
> +		list_for_each_entry(node, &scheduler->node_queue[i], link)
> +			node->bumped = false;
> +	}
> +}
> +
> +static int i915_scheduler_priority_bump(struct i915_scheduler *scheduler,
> +				struct i915_scheduler_queue_entry *target,

Is there specific reason why struct i915_scheduler_queue_entry are not
referred to just as "node" but "qe" and here something else, do "node"
and "qe" have a special semantic meaning?

> +				uint32_t bump)
> +{
> +	uint32_t new_priority;
> +	int i, count;
> +
> +	if (target->priority >= scheduler->priority_level_max)
> +		return 1;

So if one node reaches maximum priority the dependencies are expected
to be maxed out already?

> +
> +	if (target->bumped)
> +		return 0;
> +
> +	new_priority = target->priority + bump;
> +	if ((new_priority <= target->priority) ||
> +	    (new_priority > scheduler->priority_level_max))
> +		target->priority = scheduler->priority_level_max;
> +	else
> +		target->priority = new_priority;
> +
> +	count = 1;
> +	target->bumped = true;
> +
> +	for (i = 0; i < target->num_deps; i++) {
> +		if (!target->dep_list[i])
> +			continue;
> +
> +		if (target->dep_list[i]->bumped)
> +			continue;
> +
> +		count += i915_scheduler_priority_bump(scheduler,
> +						      target->dep_list[i],
> +						      bump);
> +	}
> +
> +	return count;
> +}
> +
> +/*
> + * Nodes are considered valid dependencies if they are queued on any ring or
> + * if they are in flight on a different ring. In flight on the same ring is no
> + * longer interesting for non-premptive nodes as the ring serialises execution.
> + * For pre-empting nodes, all in flight dependencies are valid as they must not
> + * be jumped by the act of pre-empting.
> + *
> + * Anything that is neither queued nor flying is uninteresting.
> + */
> +static inline bool i915_scheduler_is_dependency_valid(
> +			struct i915_scheduler_queue_entry *node, uint32_t idx)
> +{
> +	struct i915_scheduler_queue_entry *dep;
> +
> +	dep = node->dep_list[idx];
> +	if (!dep)
> +		return false;
> +
> +	if (I915_SQS_IS_QUEUED(dep))
> +		return true;
> +
> +	if (I915_SQS_IS_FLYING(dep)) {
> +		if (node->params.ring != dep->params.ring)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int i915_scheduler_pop_from_queue_locked(struct intel_engine_cs *ring,
> +				struct i915_scheduler_queue_entry **pop_node)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *best = NULL;
> +	struct i915_scheduler_queue_entry *node;
> +	int ret;
> +	int i;
> +	bool any_queued = false;
> +	bool has_local, has_remote, only_remote = false;
> +
> +	*pop_node = NULL;
> +	ret = -ENODATA;
> +
> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
> +		if (!I915_SQS_IS_QUEUED(node))
> +			continue;
> +		any_queued = true;
> +
> +		has_local  = false;
> +		has_remote = false;
> +		for (i = 0; i < node->num_deps; i++) {
> +			if (!i915_scheduler_is_dependency_valid(node, i))
> +				continue;
> +
> +			if (node->dep_list[i]->params.ring == node->params.ring)
> +				has_local = true;
> +			else
> +				has_remote = true;
> +		}
> +
> +		if (has_remote && !has_local)
> +			only_remote = true;
> +
> +		if (!has_local && !has_remote) {
> +			if (!best ||
> +			    (node->priority > best->priority))
> +				best = node;
> +		}
> +	}
> +
> +	if (best) {
> +		list_del(&best->link);
> +
> +		INIT_LIST_HEAD(&best->link);
> +		best->status  = i915_sqs_popped;
> +
> +		ret = 0;
> +	} else {
> +		/* Can only get here if:
> +		 * (a) there are no buffers in the queue
> +		 * (b) all queued buffers are dependent on other buffers
> +		 *     e.g. on a buffer that is in flight on a different ring
> +		 */
> +		if (only_remote) {
> +			/* The only dependent buffers are on another ring. */
> +			ret = -EAGAIN;
> +		} else if (any_queued) {
> +			/* It seems that something has gone horribly wrong! */
> +			DRM_ERROR("Broken dependency tracking on ring %d!\n",
> +				  (int) ring->id);

Would this condition be worthy WARN_ONCE(), because it will keep
repeating as the situation will persist?

> +		}
> +	}
> +
> +	*pop_node = best;
> +	return ret;
> +}
> +
> +/*
> + * NB: The driver mutex lock must be held before calling this function. It is
> + * only really required during the actual back end submission call. However,
> + * attempting to acquire a mutex while holding a spin lock is a Bad Idea.
> + * And releasing the one before acquiring the other leads to other code
> + * being run and interfering.
> + */
> +static int i915_scheduler_submit(struct intel_engine_cs *ring)
> +{
> +	struct drm_device *dev = ring->dev;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node;
> +	int ret, count = 0, flying;
> +
> +	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
> +
> +	spin_lock_irq(&scheduler->lock);
> +
> +	WARN_ON(scheduler->flags[ring->id] & i915_sf_submitting);
> +	scheduler->flags[ring->id] |= i915_sf_submitting;
> +
> +	/* First time around, complain if anything unexpected occurs: */
> +	ret = i915_scheduler_pop_from_queue_locked(ring, &node);
> +	if (ret)
> +		goto error;
> +
> +	do {
> +		WARN_ON(!node);
> +		WARN_ON(node->params.ring != ring);
> +		WARN_ON(node->status != i915_sqs_popped);
> +		count++;
> +
> +		/*
> +		 * The call to pop above will have removed the node from the
> +		 * list. So add it back in and mark it as in flight.
> +		 */
> +		i915_scheduler_node_fly(node);
> +
> +		spin_unlock_irq(&scheduler->lock);
> +		ret = dev_priv->gt.execbuf_final(&node->params);
> +		spin_lock_irq(&scheduler->lock);
> +
> +		/*
> +		 * Handle failed submission but first check that the
> +		 * watchdog/reset code has not nuked the node while we
> +		 * weren't looking:
> +		 */
> +		if (ret && (node->status != i915_sqs_dead)) {
> +			bool requeue = true;
> +
> +			/*
> +			 * Oh dear! Either the node is broken or the ring is
> +			 * busy. So need to kill the node or requeue it and try
> +			 * again later as appropriate.
> +			 */
> +
> +			switch (-ret) {
> +			case ENODEV:
> +			case ENOENT:
> +				/* Fatal errors. Kill the node. */
> +				requeue = false;
> +				i915_scheduler_node_kill(node);
> +			break;

These break indents still.

> +
> +			case EAGAIN:
> +			case EBUSY:
> +			case EIO:
> +			case ENOMEM:
> +			case ERESTARTSYS:
> +			case EINTR:
> +				/* Supposedly recoverable errors. */
> +			break;
> +
> +			default:
> +				/*
> +				 * Assume the error is recoverable and hope
> +				 * for the best.
> +				 */
> +				MISSING_CASE(-ret);
> +			break;
> +			}
> +
> +			if (requeue) {
> +				i915_scheduler_node_requeue(node);
> +				/*
> +				 * No point spinning if the ring is currently
> +				 * unavailable so just give up and come back
> +				 * later.
> +				 */
> +				break;
> +			}
> +		}
> +
> +		/* Keep launching until the sky is sufficiently full. */
> +		flying = i915_scheduler_count_flying(scheduler, ring);
> +		if (flying >= scheduler->min_flying)
> +			break;
> +
> +		/* Grab another node and go round again... */
> +		ret = i915_scheduler_pop_from_queue_locked(ring, &node);
> +	} while (ret == 0);
> +
> +	/* Don't complain about not being able to submit extra entries */
> +	if (ret == -ENODATA)
> +		ret = 0;
> +
> +	/*
> +	 * Bump the priority of everything that was not submitted to prevent
> +	 * starvation of low priority tasks by a spamming high priority task.
> +	 */

This, this calls for an I-G-T test to make sure we're effective.

> +	i915_scheduler_priority_bump_clear(scheduler);
> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
> +		if (!I915_SQS_IS_QUEUED(node))
> +			continue;
> +
> +		i915_scheduler_priority_bump(scheduler, node,
> +					     scheduler->priority_level_bump);
> +	}
> +
> +	/* On success, return the number of buffers submitted. */
> +	if (ret == 0)
> +		ret = count;
> +
> +error:
> +	scheduler->flags[ring->id] &= ~i915_sf_submitting;
> +	spin_unlock_irq(&scheduler->lock);
> +	return ret;
> +}
> +
> +static void i915_generate_dependencies(struct i915_scheduler *scheduler,
> +				       struct i915_scheduler_queue_entry *node,
> +				       uint32_t ring)
> +{
> +	struct i915_scheduler_obj_entry *this, *that;
> +	struct i915_scheduler_queue_entry *test;
> +	int i, j;
> +	bool found;
> +
> +	list_for_each_entry(test, &scheduler->node_queue[ring], link) {
> +		if (I915_SQS_IS_COMPLETE(test))
> +			continue;
> +
> +		/*
> +		 * Batches on the same ring for the same
> +		 * context must be kept in order.
> +		 */
> +		found = (node->params.ctx == test->params.ctx) &&
> +			(node->params.ring == test->params.ring);
> +
> +		/*
> +		 * Batches working on the same objects must
> +		 * be kept in order.
> +		 */
> +		for (i = 0; (i < node->num_objs) && !found; i++) {
> +			this = node->saved_objects + i;

node->objs and node->num_objs would be a better naming match. num_objs
and saved_objects makes me think we have a bug here, indexing wrong
array.

> +
> +			for (j = 0; j < test->num_objs; j++) {
> +				that = test->saved_objects + j;
> +
> +				if (this->obj != that->obj)
> +					continue;
> +
> +				/* Only need to worry about writes */
> +				if (this->read_only && that->read_only)
> +					continue;
> +
> +				found = true;
> +				break;
> +			}
> +		}
> +
> +		if (found) {
> +			node->dep_list[node->num_deps] = test;
> +			node->num_deps++;
> +		}
> +	}
> +}
> +
> +static int i915_scheduler_queue_execbuffer_bypass(struct i915_scheduler_queue_entry *qe)
> +{
> +	struct drm_i915_private *dev_priv = qe->params.dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	int ret;
> +
> +	scheduler->flags[qe->params.ring->id] |= i915_sf_submitting;
> +	ret = dev_priv->gt.execbuf_final(&qe->params);
> +	scheduler->flags[qe->params.ring->id] &= ~i915_sf_submitting;
> +

The above code makes me think of locking again, maybe document at top
of this function.

> +	/*
> +	 * Don't do any clean up on failure because the caller will
> +	 * do it all anyway.
> +	 */
> +	if (ret)
> +		return ret;
> +
> +	/* Need to release any resources held by the node: */
> +	i915_scheduler_clean_node(qe);
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_queue_execbuffer - Submit a batch buffer request to the
> + * scheduler.
> + * @qe: The batch buffer request to be queued.
> + * The expectation is the qe passed in is a local stack variable. This
> + * function will copy its contents into a freshly allocated list node. The
> + * new node takes ownership of said contents so the original qe should simply
> + * be discarded and not cleaned up (i.e. don't free memory it points to or
> + * dereference objects it holds). The node is added to the scheduler's queue
> + * and the batch buffer will be submitted to the hardware at some future
> + * point in time (which may be immediately, before returning or may be quite
> + * a lot later).
> + */
> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe)
> +{
> +	struct drm_i915_private *dev_priv = qe->params.dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct intel_engine_cs *ring = qe->params.ring;
> +	struct i915_scheduler_queue_entry *node;
> +	struct i915_scheduler_queue_entry *test;
> +	bool not_flying;
> +	int i, r;
> +	int incomplete = 0;
> +
> +	WARN_ON(!scheduler);
> +
> +	if (1/*!i915.enable_scheduler*/)
> +		return i915_scheduler_queue_execbuffer_bypass(qe);
> +

I'm thinking, should we branch already before calling a function named
i915_scheduler_queue_execbuffer if scheduler is disabled?

> +	node = kmalloc(sizeof(*node), GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	*node = *qe;

Now I read the added documentation for the function, maybe we should at
least zero out qe to avoid future problems?

> +	INIT_LIST_HEAD(&node->link);
> +	node->status = i915_sqs_queued;
> +	node->stamp  = jiffies;
> +	i915_gem_request_reference(node->params.request);
> +
> +	WARN_ON(node->params.request->scheduler_qe);
> +	node->params.request->scheduler_qe = node;
> +
> +	/* Need to determine the number of incomplete entries in the list as
> +	 * that will be the maximum size of the dependency list.
> +	 *
> +	 * Note that the allocation must not be made with the spinlock acquired
> +	 * as kmalloc can sleep. However, the unlock/relock is safe because no
> +	 * new entries can be queued up during the unlock as the i915 driver
> +	 * mutex is still held. Entries could be removed from the list but that
> +	 * just means the dep_list will be over-allocated which is fine.
> +	 */
> +	spin_lock_irq(&scheduler->lock);

-->

> +	for (r = 0; r < I915_NUM_RINGS; r++) {
> +		list_for_each_entry(test, &scheduler->node_queue[r], link) {
> +			if (I915_SQS_IS_COMPLETE(test))
> +				continue;
> +
> +			incomplete++;
> +		}
> +	}
> +

<-- This counting structure, I still think it would be good idea to
make it a static helper.

> +	/* Temporarily unlock to allocate memory: */
> +	spin_unlock_irq(&scheduler->lock);
> +	if (incomplete) {
> +		node->dep_list = kmalloc_array(incomplete,
> +					       sizeof(*node->dep_list),
> +					       GFP_KERNEL);
> +		if (!node->dep_list) {
> +			kfree(node);
> +			return -ENOMEM;
> +		}
> +	} else
> +		node->dep_list = NULL;
> +
> +	spin_lock_irq(&scheduler->lock);
> +	node->num_deps = 0;
> +
> +	if (node->dep_list) {
> +		for (r = 0; r < I915_NUM_RINGS; r++)
> +			i915_generate_dependencies(scheduler, node, r);
> +
> +		WARN_ON(node->num_deps > incomplete);
> +	}
> +
> +	node->priority = clamp(node->priority,
> +			       scheduler->priority_level_min,
> +			       scheduler->priority_level_max);
> +
> +	if ((node->priority > 0) && node->num_deps) {
> +		i915_scheduler_priority_bump_clear(scheduler);
> +
> +		for (i = 0; i < node->num_deps; i++)
> +			i915_scheduler_priority_bump(scheduler,
> +					node->dep_list[i], node->priority);

If node is posted with maximum priority to begin with, wouldn't the
priority propagation stop after first level dependencies due to the
check in the beginning of priority_bump function?

> +
> +	list_add_tail(&node->link, &scheduler->node_queue[ring->id]);
> +
> +	not_flying = i915_scheduler_count_flying(scheduler, ring) <
> +						 scheduler->min_flying;
> +
> +	spin_unlock_irq(&scheduler->lock);
> +
> +	if (not_flying)
> +		i915_scheduler_submit(ring);
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_notify_request - Notify the scheduler that the given
> + * request has completed on the hardware.
> + * @req: Request structure which has completed
> + * @preempt: Did it complete pre-emptively?
> + * A sequence number has popped out of the hardware and the request handling
> + * code has mapped it back to a request and will mark that request complete.
> + * It also calls this function to notify the scheduler about the completion
> + * so the scheduler's node can be updated appropriately.
> + * Returns true if the request is scheduler managed, false if not. The return
> + * value is combined for all freshly completed requests and if any were true
> + * then i915_scheduler_wakeup() is called so the scheduler can do further
> + * processing (submit more work) at the end.
> + */
> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req)
> +{
> +	struct drm_i915_private *dev_priv = to_i915(req->ring->dev);
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node = req->scheduler_qe;
> +	unsigned long flags;
> +
> +	if (!node)
> +		return false;
> +
> +	spin_lock_irqsave(&scheduler->lock, flags);
> +
> +	WARN_ON(!I915_SQS_IS_FLYING(node));
> +
> +	/* Node was in flight so mark it as complete. */
> +	if (req->cancelled)
> +		node->status = i915_sqs_dead;
> +	else
> +		node->status = i915_sqs_complete;
> +
> +	spin_unlock_irqrestore(&scheduler->lock, flags);
> +
> +	return true;
> +}
> +
> +static int i915_scheduler_remove_dependent(struct i915_scheduler *scheduler,
> +				struct i915_scheduler_queue_entry *remove)
> +{
> +	struct i915_scheduler_queue_entry *node;
> +	int i, r;
> +	int count = 0;
> +
> +	/*
> +	 * Ensure that a node is not being removed which is still dependent
> +	 * upon other (not completed) work. If that happens, it implies
> +	 * something has gone very wrong with the dependency tracking! Note
> +	 * that there is no need to worry if this node has been explicitly
> +	 * killed for some reason - it might be being killed before it got
> +	 * sent to the hardware.
> +	 */
> +	if (remove->status != i915_sqs_dead) {
> +		for (i = 0; i < remove->num_deps; i++)
> +			if ((remove->dep_list[i]) &&
> +			    (!I915_SQS_IS_COMPLETE(remove->dep_list[i])))
> +				count++;
> +		WARN_ON(count);
> +	}
> +
> +	/*
> +	 * Remove this node from the dependency lists of any other node which
> +	 * might be waiting on it.
> +	 */
> +	for (r = 0; r < I915_NUM_RINGS; r++) {
> +		list_for_each_entry(node, &scheduler->node_queue[r], link) {
> +			for (i = 0; i < node->num_deps; i++) {
> +				if (node->dep_list[i] != remove)
> +					continue;
> +
> +				node->dep_list[i] = NULL;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * i915_scheduler_wakeup - wake the scheduler's worker thread
> + * @dev: DRM device
> + * Called at the end of seqno interrupt processing if any request has
> + * completed that corresponds to a scheduler node.
> + */
> +void i915_scheduler_wakeup(struct drm_device *dev)
> +{
> +	/* XXX: Need to call i915_scheduler_remove() via work handler. */
> +}
> +
> +/**
> + * i915_scheduler_clean_node - free up any allocations/references
> + * associated with the given scheduler queue entry.
> + * @node: Queue entry structure which is complete
> + * After a give batch buffer completes on the hardware, all the information
> + * required to resubmit it is no longer required. However, the node entry
> + * itself might still be required for tracking purposes for a while longer.
> + * This function should be called as soon as the node is known to be complete
> + * so that these resources may be freed even though the node itself might
> + * hang around.
> + */
> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node)
> +{
> +	if (!I915_SQS_IS_COMPLETE(node)) {
> +		WARN(!node->params.request->cancelled,
> +		     "Cleaning active node: %d!\n", node->status);
> +		return;
> +	}
> +
> +	if (node->params.batch_obj) {
> +		/*
> +		 * The batch buffer must be unpinned before it is unreferenced
> +		 * otherwise the unpin fails with a missing vma!?
> +		 */
> +		if (node->params.dispatch_flags & I915_DISPATCH_SECURE)
> +			i915_gem_execbuff_release_batch_obj(node->params.batch_obj);
> +

This code seems little bit out of place.

> +		node->params.batch_obj = NULL;
> +	}
> +
> +	/* And anything else owned by the node: */
> +	if (node->params.cliprects) {
> +		kfree(node->params.cliprects);
> +		node->params.cliprects = NULL;
> +	}
> +}
> +

The below function is quite hacky, I understood this will be mitigated
by per-ctx sequence numbering in the future?

> +static bool i915_scheduler_remove(struct i915_scheduler *scheduler,
> +				  struct intel_engine_cs *ring,
> +				  struct list_head *remove)
> +{
> +	struct i915_scheduler_queue_entry *node, *node_next;
> +	int flying = 0, queued = 0;
> +	bool do_submit;
> +	uint32_t min_seqno;
> +
> +	spin_lock_irq(&scheduler->lock);
> +
> +	/*
> +	 * In the case where the system is idle, starting 'min_seqno' from a big
> +	 * number will cause all nodes to be removed as they are now back to
> +	 * being in-order. However, this will be a problem if the last one to
> +	 * complete was actually out-of-order as the ring seqno value will be
> +	 * lower than one or more completed buffers. Thus code looking for the
> +	 * completion of said buffers will wait forever.
> +	 * Instead, use the hardware seqno as the starting point. This means
> +	 * that some buffers might be kept around even in a completely idle
> +	 * system but it should guarantee that no-one ever gets confused when
> +	 * waiting for buffer completion.
> +	 */
> +	min_seqno = ring->get_seqno(ring, true);
> +
> +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
> +		if (I915_SQS_IS_QUEUED(node))
> +			queued++;
> +		else if (I915_SQS_IS_FLYING(node))
> +			flying++;
> +		else if (I915_SQS_IS_COMPLETE(node))
> +			continue;
> +
> +		if (node->params.request->seqno == 0)
> +			continue;
> +
> +		if (!i915_seqno_passed(node->params.request->seqno, min_seqno))
> +			min_seqno = node->params.request->seqno;
> +	}
> +
> +	INIT_LIST_HEAD(remove);
> +	list_for_each_entry_safe(node, node_next, &scheduler->node_queue[ring->id], link) {
> +		/*
> +		 * Only remove completed nodes which have a lower seqno than
> +		 * all pending nodes. While there is the possibility of the
> +		 * ring's seqno counting backwards, all higher buffers must
> +		 * be remembered so that the 'i915_seqno_passed()' test can
> +		 * report that they have in fact passed.
> +		 *
> +		 * NB: This is not true for 'dead' nodes. The GPU reset causes
> +		 * the software seqno to restart from its initial value. Thus
> +		 * the dead nodes must be removed even though their seqno values
> +		 * are potentially vastly greater than the current ring seqno.
> +		 */
> +		if (!I915_SQS_IS_COMPLETE(node))
> +			continue;
> +
> +		if (node->status != i915_sqs_dead) {
> +			if (i915_seqno_passed(node->params.request->seqno, min_seqno) &&
> +			    (node->params.request->seqno != min_seqno))
> +				continue;
> +		}
> +
> +		list_del(&node->link);
> +		list_add(&node->link, remove);
> +
> +		/* Strip the dependency info while the mutex is still locked */
> +		i915_scheduler_remove_dependent(scheduler, node);
> +
> +		continue;
> +	}
> +
> +	/*
> +	 * Release the interrupt reference count if there are no longer any
> +	 * nodes to worry about.
> +	 */
> +	if (!flying && !queued &&
> +	    (scheduler->flags[ring->id] & i915_sf_interrupts_enabled)) {
> +		ring->irq_put(ring);
> +		scheduler->flags[ring->id] &= ~i915_sf_interrupts_enabled;

Maybe have this piece of code as a helper? To enable interrupts and
disable them, maybe even a reference count?

> +	}
> +
> +	/* Launch more packets now? */
> +	do_submit = (queued > 0) && (flying < scheduler->min_flying);
> +
> +	spin_unlock_irq(&scheduler->lock);
> +
> +	return do_submit;
> +}
> +
> +void i915_scheduler_process_work(struct intel_engine_cs *ring)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct i915_scheduler *scheduler = dev_priv->scheduler;
> +	struct i915_scheduler_queue_entry *node;
> +	bool do_submit;
> +	struct list_head remove;
> +
> +	if (list_empty(&scheduler->node_queue[ring->id]))
> +		return;
> +
> +	/* Remove completed nodes. */
> +	do_submit = i915_scheduler_remove(scheduler, ring, &remove);
> +
> +	if (!do_submit && list_empty(&remove))
> +		return;
> +
> +	/* Need to grab the pm lock outside of the mutex lock */
> +	if (do_submit)
> +		intel_runtime_pm_get(dev_priv);
> +
> +	mutex_lock(&ring->dev->struct_mutex);
> +
> +	if (do_submit)
> +		i915_scheduler_submit(ring);
> +
> +	while (!list_empty(&remove)) {
> +		node = list_first_entry(&remove, typeof(*node), link);
> +		list_del(&node->link);
> +
> +		/* Free up all the DRM references */
> +		i915_scheduler_clean_node(node);
> +
> +		/* And anything else owned by the node: */
> +		node->params.request->scheduler_qe = NULL;
> +		i915_gem_request_unreference(node->params.request);
> +		kfree(node->dep_list);
> +		kfree(node);

Maybe have a separate helper for freeing a node, would make this much
cleaner.

> +	}
> +
> +	mutex_unlock(&ring->dev->struct_mutex);
> +
> +	if (do_submit)
> +		intel_runtime_pm_put(dev_priv);
> +}
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
> new file mode 100644
> index 0000000..415fec8
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/i915_scheduler.h
> @@ -0,0 +1,95 @@
> +/*
> + * Copyright (c) 2014 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef _I915_SCHEDULER_H_
> +#define _I915_SCHEDULER_H_
> +
> +enum i915_scheduler_queue_status {
> +	/* Limbo: */
> +	i915_sqs_none = 0,
> +	/* Not yet submitted to hardware: */
> +	i915_sqs_queued,
> +	/* Popped from queue, ready to fly: */
> +	i915_sqs_popped,
> +	/* Sent to hardware for processing: */
> +	i915_sqs_flying,
> +	/* Finished processing on the hardware: */
> +	i915_sqs_complete,
> +	/* Killed by watchdog or catastrophic submission failure: */
> +	i915_sqs_dead,
> +	/* Limit value for use with arrays/loops */
> +	i915_sqs_MAX
> +};
> +

To uppercase.

> +#define I915_SQS_IS_QUEUED(node)	(((node)->status == i915_sqs_queued))
> +#define I915_SQS_IS_FLYING(node)	(((node)->status == i915_sqs_flying))
> +#define I915_SQS_IS_COMPLETE(node)	(((node)->status == i915_sqs_complete) || \
> +					 ((node)->status == i915_sqs_dead))
> +
> +struct i915_scheduler_obj_entry {
> +	struct drm_i915_gem_object          *obj;
> +	bool                                read_only;
> +};
> +
> +struct i915_scheduler_queue_entry {
> +	struct i915_execbuffer_params       params;
> +	/* -1023 = lowest priority, 0 = default, 1023 = highest */
> +	int32_t                             priority;
> +	struct i915_scheduler_obj_entry     *saved_objects;
> +	int                                 num_objs;
> +	bool                                bumped;
> +	struct i915_scheduler_queue_entry   **dep_list;
> +	int                                 num_deps;
> +	enum i915_scheduler_queue_status    status;
> +	unsigned long                       stamp;
> +	struct list_head                    link;

Above whitespace is still incorrect and I could really use comments.

> +};
> +
> +struct i915_scheduler {
> +	struct list_head    node_queue[I915_NUM_RINGS];
> +	uint32_t            flags[I915_NUM_RINGS];
> +	spinlock_t          lock;
> +
> +	/* Tuning parameters: */
> +	int32_t             priority_level_min;
> +	int32_t             priority_level_max;
> +	int32_t             priority_level_bump;
> +	int32_t             priority_level_preempt;
> +	uint32_t            min_flying;
> +};
> +
> +/* Flag bits for i915_scheduler::flags */
> +enum {
> +	i915_sf_interrupts_enabled  = (1 << 0),
> +	i915_sf_submitting          = (1 << 1),
> +};

These must be I915_UPPERCASE_SOMETHING by convention.

> +
> +bool i915_scheduler_is_enabled(struct drm_device *dev);
> +int i915_scheduler_init(struct drm_device *dev);
> +void i915_scheduler_clean_node(struct i915_scheduler_queue_entry *node);
> +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe);
> +bool i915_scheduler_notify_request(struct drm_i915_gem_request *req);
> +void i915_scheduler_wakeup(struct drm_device *dev);
> +
> +#endif  /* _I915_SCHEDULER_H_ */
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation