[Intel-gfx] [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd

Mon Nov 30 04:30:29 PST 2015

On Mon, Nov 30, 2015 at 12:05:50PM +0000, Tvrtko Ursulin wrote:
> >+	struct intel_breadcrumbs {
> >+		spinlock_t lock; /* protects the per-engine request list */
> 
> s/list/tree/

list. We use the tree as a list!

> >+		struct task_struct *task; /* bh for all user interrupts */
> >+		struct intel_breadcrumbs_engine {
> >+			struct rb_root requests; /* sorted by retirement */
> >+			struct drm_i915_gem_request *first;
> 
> /* First request in the tree to be completed next by the hw. */
> 
> ?

/* first */ ?

> >+		} engine[I915_NUM_RINGS];
> >+	} breadcrumbs;
> >+
> >  	struct i915_virtual_gpu vgpu;
> >
> >  	struct intel_guc guc;
> >@@ -2228,6 +2250,11 @@ struct drm_i915_gem_request {
> >  	/** global list entry for this request */
> >  	struct list_head list;
> >
> >+	/** List of clients waiting for completion of this request */
> >+	wait_queue_head_t wait;
> >+	struct rb_node irq_node;
> >+	unsigned irq_count;
> 
> To me irq prefix does not fit here that well.
> 
> req_tree_node?
> 
> waiter_count, waiters?

waiters, breadcrumb_node, waiters_count ?

> >  	for (;;) {
> >-		struct timer_list timer;
> >-
> >-		prepare_to_wait(&ring->irq_queue, &wait, state);
> >+		prepare_to_wait(&req->wait, &wait, state);
> >
> >-		/* We need to check whether any gpu reset happened in between
> >-		 * the caller grabbing the seqno and now ... */
> >-		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> >-			/* As we do not requeue the request over a GPU reset,
> >-			 * if one does occur we know that the request is
> >-			 * effectively complete.
> >-			 */
> >-			ret = 0;
> >-			break;
> >-		}
> >-
> >-		if (i915_gem_request_completed(req, false)) {
> >+		if (i915_gem_request_completed(req, true) ||
> 
> Why it is OK to change the lazy coherency mode here?

We bang on the coherency after the IRQ. Here, we are woken by the IRQ
waiter so we know that the IRQ/seqno coherency is fine and we can just
check the CPU cache.

> >+		    req->reset_counter != i915_reset_counter(&req->i915->gpu_error)) {
> 
> It looks like it would be valuable to preserve the comment about no
> re-queuing etc on GPU reset.

I can drop it from the previous patch :)

> >+		if (timeout) {
> >+			timeout_remain = io_schedule_timeout(timeout_remain);
> >+			if (timeout_remain == 0) {
> >+				ret = -ETIME;
> >+				break;
> >+			}
> >+		} else
> >+			io_schedule();
> 
> Could set timeout_remain to that MAX value when timeout == NULL and
> have a single call site to io_schedule_timeout and less indentation.
> It doesn't matter hugely, maybe it would be a bit easier to read.

Done. io_schedule() calls io_schedule_timeout(), whereas
schedule_timeout() calls schedule()!

> >@@ -900,7 +901,7 @@ static void i915_record_ring_state(struct drm_device *dev,
> >  		ering->instdone = I915_READ(GEN2_INSTDONE);
> >  	}
> >
> >-	ering->waiting = waitqueue_active(&ring->irq_queue);
> >+	ering->waiting = READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first);
> 
> What does the READ_ONCE do here?

This element is protected by the breadcrumbs spinlock, the READ_ONCE is
documentation that we don't care about the lock here.

> >  static void notify_ring(struct intel_engine_cs *ring)
> >  {
> >-	if (!intel_ring_initialized(ring))
> >+	if (ring->i915 == NULL)
> >  		return;
> >
> >  	trace_i915_gem_request_notify(ring);
> >-
> >-	wake_up_all(&ring->irq_queue);
> >+	wake_up_process(ring->i915->breadcrumbs.task);
> 
> For a later extension:
> 
> How would you view, some time in the future, adding a bool return to
> ring->put_irq() which would say to the thread that it failed to
> disable interrupts?
> 
> In this case thread would exit and would need to be restarted when
> the next waiter queues up. Possibly extending the
> idle->put_irq->exit durations as well then.

Honestly, not keen on the idea that the lowlevel put_irq can fail (makes
cleanup much more fraught). I don't see what improvement you intend
here, maybe clarifying that would help explain what you mean.

> >@@ -2986,16 +2981,16 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  			if (ring_idle(ring, seqno)) {
> >  				ring->hangcheck.action = HANGCHECK_IDLE;
> >
> >-				if (waitqueue_active(&ring->irq_queue)) {
> >+				if (READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first)) {
> 
> READ_ONCE is again strange, but other than that I don't know the
> hangcheck code to really evaulate it.
> 
> Perhaps this also means this line needs a comment describing what
> condition/state is actually approximating with the check.

All we are doing is asking if anyone is waiting on the GPU, because the
GPU has finished all of its work. If so, the IRQ handling is suspect
(modulo the pre-existing race condition clearly earmarked by READ_ONCE).

> >+	while (!kthread_should_stop()) {
> >+		unsigned reset_counter = i915_reset_counter(&i915->gpu_error);
> >+		unsigned long timeout_jiffies;
> >+		bool idling = false;
> >+
> >+		/* On every tick, walk the seqno-ordered list of requests
> >+		 * and for each retired request wakeup its clients. If we
> >+		 * find an unfinished request, go back to sleep.
> >+		 */
> 
> s/tick/loop/ ?
s/tick/iteration/ I'm sticking with tick

> And if we don't find and unfinished request still go back to sleep. :)

Ok.

> >+		set_current_state(TASK_INTERRUPTIBLE);
> >+
> >+		/* Note carefully that we do not hold a reference for the
> >+		 * requests on this list. Therefore we only inspect them
> >+		 * whilst holding the spinlock to ensure that they are not
> >+		 * freed in the meantime, and the client must remove the
> >+		 * request from the list if it is interrupted (before it
> >+		 * itself releases its reference).
> >+		 */
> >+		spin_lock(&b->lock);
> 
> This lock is more per engine that global in its nature unless you
> think it was more efficient to do fewer lock atomics vs potential
> overlap of waiter activity per engines?

Indeed. I decided not to care about making it per-engine.

> Would need a new lock for req->irq_count, per request. So
> req->lock/req->irq_count and be->lock for the tree operations.

Why? A request is tied to an individual engine, therefore we can use
that breadcrumb_engine.lock for the request.

> Thread would only need to deal with the later. Feels like it would
> work, just not sure if it is worth it.

That was my feeling as well. Not sure if we will have huge contention
that would be solved with per-engine spinlocks, whilst we still have
struct_mutex.

> >+		for (i = 0; i < I915_NUM_RINGS; i++) {
> >+			struct intel_engine_cs *engine = &i915->ring[i];
> >+			struct intel_breadcrumbs_engine *be = &b->engine[i];
> >+			struct drm_i915_gem_request *request = be->first;
> >+
> >+			if (request == NULL) {
> >+				if ((irq_get & (1 << i))) {
> >+					if (irq_enabled & (1 << i)) {
> >+						engine->irq_put(engine);
> >+						irq_enabled &= ~ (1 << i);
> >+					}
> >+					intel_runtime_pm_put(i915);
> >+					irq_get &= ~(1 << i);
> >+				}
> >+				continue;
> >+			}
> >+
> >+			if ((irq_get & (1 << i)) == 0) {
> >+				intel_runtime_pm_get(i915);
> >+				irq_enabled |= __irq_enable(engine) << i;
> >+				irq_get |= 1 << i;
> >+			}
> 
> irq_get and irq_enabled are creating a little bit of a headache for
> me. And that would probably mean a comment is be in order to explain
> what they are for and how they work.
> 
> Like this I don't see the need for irq_get to persist across loops?

Because getting the irq is quite slow, we add a jiffie shadow.

> irq_get looks like "try to get these", irq_enabled is "these ones I
> got". Apart for the weird usage of it to balance the runtime pm gets
> and puts.
> 
> A bit confusing as I said.

I don't like the names, but haven't thought of anything better.

> >+bool intel_breadcrumbs_add_waiter(struct drm_i915_gem_request *request)
> >+{
> >+	struct intel_breadcrumbs *b = &request->i915->breadcrumbs;
> >+	bool first = false;
> >+
> >+	spin_lock(&b->lock);
> >+	if (request->irq_count++ == 0) {
> >+		struct intel_breadcrumbs_engine *be =
> >+			&b->engine[request->ring->id];
> >+		struct rb_node **p, *parent;
> >+
> >+		if (be->first == NULL)
> >+			wake_up_process(b->task);
> 
> With a global b->lock

or even per engine

> it makes no difference but in the logical
> sense this would usually go last, after the request has been
> inserted into the tree and waiter initialized.

Except that the code is simpler with kicking the task first.

Since we both thought it might make sense with a per-engine lock, let's
go with it. Maybe one day it will pay off.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre