[Intel-gfx] [PATCH v4 07/38] drm/i915: Start of GPU scheduler

Fri Feb 19 10:13:43 UTC 2016

Hi,

Adding Daniel as CC to comment below.

On to, 2016-02-18 at 14:22 +0000, John Harrison wrote:
> On 20/01/2016 13:18, Joonas Lahtinen wrote:
> > On Mon, 2016-01-11 at 18:42 +0000, John.C.Harrison at Intel.com wrote:
> > > From: John Harrison <John.C.Harrison at Intel.com>
> > > 

<SNIP>

> > > 
> > > +			this = node->saved_objects + i;
> > > +
> > > +			for (j = 0; j < test->num_objs; j++) {
> > > +				that = test->saved_objects + j;
> > > +
> > > +				if (this->obj != that->obj)
> > > +					continue;
> > How about VMAs? There might be multiple mappings to an object, isn't it
> > enough to depend on the required VMA instead of the whole object?
> The object is what we get coming in from user land through the IOCTL. So 
> why make things more complicated? If there are multiple VMAs referring 
> to the same object then we can't just track an individual VMA as that 
> would loose the dependency on all the other VMAs. Just because the 
> object is mapped to someone else's address space doesn't mean that this 
> batch buffer can't overwrite data they are reading.
> 

Right, makes sense.

> > > +int i915_scheduler_queue_execbuffer(struct i915_scheduler_queue_entry *qe)
> > > +{
> > > +	struct drm_i915_private *dev_priv = qe->params.dev->dev_private;
> > > +	struct i915_scheduler   *scheduler = dev_priv->scheduler;
> > > +	struct intel_engine_cs  *ring = qe->params.ring;
> > > +	struct i915_scheduler_queue_entry  *node;
> > > +	struct i915_scheduler_queue_entry  *test;
> > > +	unsigned long       flags;
> > > +	bool                not_flying;
> > > +	int                 i, r;
> > > +	int                 incomplete = 0;
> > > +
> > > +	WARN_ON(!scheduler);
> > > +
> > This kind of situations should have a be a BUG_ON, because scheduler
> > being zero is literally going to cause an OOPS in the next dereference
> > which is going to happen unconditionally. WARN + OOPS is kind of what
> > BUG_ON should be used avoid. But this should be removed anyway after
> > scheduler is made a data member of dev_priv.
> 
> The WARNs were originally BUGs but Daniel Vetter had the opposite 
> demand. His view was the driver should never BUG under any 
> circumstances. A WARN followed by an oops is better than a BUG because 
> maybe it won't actually oops.
> 

WARN_ON is better than BUG_ON when there won't be an immediate OOPS.
But if you're doing a null pointer dereference like here if scheduler
is NULL, it is 100% sure it is going to either OOPS or cause horrible
undefined behaviour (on non-x86 platform).

Something like;

	if (WARN_ON(!a))
		return -ENODEV;

Could be what Daniel meant (if the error is propagated up and somehow
dealt with), but;

	WARN_ON(!a);
	a->foo = bar;

Should simply be BUG_ON(!a), because otherwise it'll just end up being
WARNING + OOPS right after each other.

> 
> > 
> > > +	if (1/*i915.scheduler_override & i915_so_direct_submit*/) {
> > I assume this is going to be addressed in a future commit. Could have
> > been introduced in this patch, too.
> > 
> > > +		int ret;
> > > +
> > > +		scheduler->flags[qe->params.ring->id] |= i915_sf_submitting;
> > > +		ret = dev_priv->gt.execbuf_final(&qe->params);
> > > +		scheduler->flags[qe->params.ring->id] &= ~i915_sf_submitting;
> > > +
> > The kerneldoc should mention locking requirements of this function.
> > 
> > > +		/*
> > > +		 * Don't do any clean up on failure because the caller will
> > > +		 * do it all anyway.
> > > +		 */
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		/* Free everything that is owned by the QE structure: */
> > > +		kfree(qe->params.cliprects);
> > > +		if (qe->params.dispatch_flags & I915_DISPATCH_SECURE)
> > > +			i915_gem_execbuff_release_batch_obj(qe->params.batch_obj);
> > > +
> > > +		return 0;
> > Above piece of code looks like its own function, so it should probably
> > be one.
> > 
> > > +	}
> > > +
> > > +	node = kmalloc(sizeof(*node), GFP_KERNEL);
> > > +	if (!node)
> > > +		return -ENOMEM;
> > > +
> > > +	*node = *qe;
> > Any reason we can't simply move ownership of qe? If not, I'd rather
> > make a clone function
> 
> The qe pointer passed in is a reference to a stack local object in the 
> execbuff code path. Thus ownership cannot be transferred. Doing it this 
> way keeps the execbuff code nice and simple and all the dynamic memory 
> management and list tracking is self contained within the scheduler.
> 

I would indicate this with const qualifier in the function parameter.

> > > +
> > > +static int i915_scheduler_fly_node(struct i915_scheduler_queue_entry *node)
> > > +{
> > > +	struct drm_i915_private *dev_priv = node->params.dev->dev_private;
> > > +	struct i915_scheduler   *scheduler = dev_priv->scheduler;
> > > +	struct intel_engine_cs  *ring;
> > > +
> > > +	WARN_ON(!scheduler);
> > > +	WARN_ON(!node);
> > > +	WARN_ON(node->status != i915_sqs_popped);
> > Other states had their I915_SQS_IS_* macro, why some don't?
> The purpose of the macro is to allow the combining of individual states 
> into classes. E.g. dead and complete can both be considered complete for 
> the majority of cases. Only in certain situations do you need to know 
> that it really was dead. Hence most places that don't really care just 
> use the merged macros, whereas places like this that do care use the 
> explicit enum value.
> 

Right, just asking cause having a macro might increase consistency.

Also here, we have WARN_ON(!node), and then next line dereferencing
node, which again is surely better off as BUG_ON.

Although it's not unreasonable to expect function to OOPS if you pass
null pointer. I think only the higher calling level should have
if(WARN_ON(!node)) return ... or BUG_ON construct, before calling this
function.

Only calls coming from userland need to be treated with such care that
anything can be passed, not internal functions. We're not having
BUG_ON(!dev) or BUG_ON(!dev_priv) around either.

> > > +	/*
> > > +	 * In the case where the system is idle, starting 'min_seqno' from a big
> > > +	 * number will cause all nodes to be removed as they are now back to
> > > +	 * being in-order. However, this will be a problem if the last one to
> > > +	 * complete was actually out-of-order as the ring seqno value will be
> > > +	 * lower than one or more completed buffers. Thus code looking for the
> > > +	 * completion of said buffers will wait forever.
> > > +	 * Instead, use the hardware seqno as the starting point. This means
> > > +	 * that some buffers might be kept around even in a completely idle
> > > +	 * system but it should guarantee that no-one ever gets confused when
> > > +	 * waiting for buffer completion.
> > > +	 */
> > > +	min_seqno = ring->get_seqno(ring, true);
> > > +
> > > +	list_for_each_entry(node, &scheduler->node_queue[ring->id], link) {
> > > +		if (I915_SQS_IS_QUEUED(node))
> > > +			queued++;
> > > +		else if (I915_SQS_IS_FLYING(node))
> > > +			flying++;
> > > +		else if (I915_SQS_IS_COMPLETE(node))
> > > +			continue;
> > > +
> > > +		if (node->params.request->seqno == 0)
> > > +			continue;
> > > +
> > > +		if (!i915_seqno_passed(node->params.request->seqno, min_seqno))
> > > +			min_seqno = node->params.request->seqno;
> > > +	}
> > Couldn't these values be kept cached, instead of counting them at each
> > function?
> The 'queued' and flying totals could be kept cached but min_seqno is 
> dependent upon the state of the hardware so needs to be recalculated. In 
> which case calculating the totals here is trivial and avoids having 
> extra code elsewhere to keep them up to date.
> 

Ok.

Btw, for_each_scheduler_node or such would be a great macro. Those are
very much preferred for readability.

$ fgrep for_each_ drivers/gpu/drm/i915/* | wc -l
525

> > 
> > > +
> > > +	INIT_LIST_HEAD(&remove);
> > > +	list_for_each_entry_safe(node, node_next, &scheduler->node_queue[ring->id], link) {
> > > +		/*
> > > +		 * Only remove completed nodes which have a lower seqno than
> > > +		 * all pending nodes. While there is the possibility of the
> > > +		 * ring's seqno counting backwards, all higher buffers must
> > > +		 * be remembered so that the 'i915_seqno_passed()' test can
> > > +		 * report that they have in fact passed.
> > > +		 *
> > > +		 * NB: This is not true for 'dead' nodes. The GPU reset causes
> > > +		 * the software seqno to restart from its initial value. Thus
> > > +		 * the dead nodes must be removed even though their seqno values
> > > +		 * are potentially vastly greater than the current ring seqno.
> > > +		 */
> > > +		if (!I915_SQS_IS_COMPLETE(node))
> > > +			continue;
> > > +
> > > +		if (node->status != i915_sqs_dead) {
> > > +			if (i915_seqno_passed(node->params.request->seqno, min_seqno) &&
> > > +			    (node->params.request->seqno != min_seqno))
> > > +				continue;
> > > +		}
> > > +
> > > +		list_del(&node->link);
> > > +		list_add(&node->link, &remove);
> > > +
> > > +		/* Strip the dependency info while the mutex is still locked */
> > > +		i915_scheduler_remove_dependent(scheduler, node);
> > > +
> > > +		continue;
> > > +	}

<SNIP>

> > > +	do {
> > > +		WARN_ON(!node);
> > > +		WARN_ON(node->params.ring != ring);
> > > +		WARN_ON(node->status != i915_sqs_popped);
> > > +		count++;
> > > +
> > > +		/*
> > > +		 * The call to pop above will have removed the node from the
> > > +		 * list. So add it back in and mark it as in flight.
> > > +		 */
> > > +		i915_scheduler_fly_node(node);
> > Why do we want to pull an object out of the list inside spin lock and
> > push it back immediately in our critical code path? Seems like a waste
> > for no obvious gain at this point. Why do not we rather just select an
> > entry and modify it in-place, if it's going to stay in the same queue
> > anyway.
> The list order is significant. The element must be moved to the front to 
> keep the submitted items in submission order. Doing it this way also 
> keeps the code nicely partitioned and easier to understand/maintain. 
> Plus, there is a plan to optimise the code by splitting the one single 
> list into three separate ones - queued, flying, complete. If/when that 
> happens, the element will have to be removed from one list and added to 
> another.

Ok.

> 
> > 
> > > +
> > > +		scheduler->flags[ring->id] |= i915_sf_submitting;
> > > +		spin_unlock_irqrestore(&scheduler->lock, flags);
> > > +		ret = dev_priv->gt.execbuf_final(&node->params);
> > > +		spin_lock_irqsave(&scheduler->lock, flags);
> > > +		scheduler->flags[ring->id] &= ~i915_sf_submitting;
> > > +
> > > +		if (ret) {
> > > +			int requeue = 1;
> > Multipurpose variable, not really a good idea. And as commented
> > further, should not exist at all.
> > 
> > > +
> > > +			/*
> > > +			 * Oh dear! Either the node is broken or the ring is
> > > +			 * busy. So need to kill the node or requeue it and try
> > > +			 * again later as appropriate.
> > > +			 */
> > > +
> > > +			switch (-ret) {
> > > +			case ENODEV:
> > > +			case ENOENT:
> > > +				/* Fatal errors. Kill the node. */
> > > +				requeue = -1;
> > > +			break;
> > "break" indent is wrong.
> > 
> > > +
> > > +			case EAGAIN:
> > > +			case EBUSY:
> > > +			case EIO:
> > > +			case ENOMEM:
> > > +			case ERESTARTSYS:
> > > +			case EINTR:
> > > +				/* Supposedly recoverable errors. */
> > > +			break;
> > > +
> > > +			default:
> > > +				/*
> > > +				 * Assume the error is recoverable and hope
> > > +				 * for the best.
> > > +				 */
> > > +				DRM_DEBUG_DRIVER("<%s> Got unexpected error from execfinal(): %d!\n",
> > > +						 ring->name, ret);
> > There's MISSING_CASE macro, should use it.
> > 
> > > +			break;
> > > +			}
> > > +
> > Just move the code below this point to the switch, no point having a
> > switch to categorize your options and then doing bunch of ifs to
> > execute code that could be in switch.
> One of the 'if' paths is to break out of the while loop. Can't do that 
> from inside the switch.

I do think the code could still be simplified, even if it involved a
carefully placed goto.

> 
> > > +			/*
> > > +			 * Check that the watchdog/reset code has not nuked
> > > +			 * the node while we weren't looking:
> > > +			 */
> > > +			if (node->status == i915_sqs_dead)
> > > +				requeue = 0;

Btw, just noticed, should some locking of node occur? Is it handled
gracefully if right after this check, the status is changed, or do we
have a race condition here?

> 
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation