[Intel-gfx] [CI 06/20] drm/i915: Slaughter the thundering i915_wait_request herd

Fri May 20 12:04:13 UTC 2016

On 19/05/16 12:32, Chris Wilson wrote:
> One particularly stressful scenario consists of many independent tasks
> all competing for GPU time and waiting upon the results (e.g. realtime
> transcoding of many, many streams). One bottleneck in particular is that
> each client waits on its own results, but every client is woken up after
> every batchbuffer - hence the thunder of hooves as then every client must
> do its heavyweight dance to read a coherent seqno to see if it is the
> lucky one.
>
> Ideally, we only want one client to wake up after the interrupt and
> check its request for completion. Since the requests must retire in
> order, we can select the first client on the oldest request to be woken.
> Once that client has completed his wait, we can then wake up the
> next client and so on. However, all clients then incur latency as every
> process in the chain may be delayed for scheduling - this may also then
> cause some priority inversion. To reduce the latency, when a client
> is added or removed from the list, we scan the tree for completed
> seqno and wake up all the completed waiters in parallel.
>
> Using igt/benchmarks/gem_latency, we can demonstrate this effect. The
> benchmark measures the number of GPU cycles between completion of a
> batch and the client waking up from a call to wait-ioctl. With many
> concurrent waiters, with each on a different request, we observe that
> the wakeup latency before the patch scales nearly linearly with the
> number of waiters (before external factors kick in making the scaling much
> worse). After applying the patch, we can see that only the single waiter
> for the request is being woken up, providing a constant wakeup latency
> for every operation. However, the situation is not quite as rosy for
> many waiters on the same request, though to the best of my knowledge this
> is much less likely in practice. Here, we can observe that the
> concurrent waiters incur extra latency from being woken up by the
> solitary bottom-half, rather than directly by the interrupt. This
> appears to be scheduler induced (having discounted adverse effects from
> having a rbtree walk/erase in the wakeup path), each additional
> wake_up_process() costs approximately 1us on big core. Another effect of
> performing the secondary wakeups from the first bottom-half is the
> incurred delay this imposes on high priority threads - rather than
> immediately returning to userspace and leaving the interrupt handler to
> wake the others.
>
> To offset the delay incurred with additional waiters on a request, we
> could use a hybrid scheme that did a quick read in the interrupt handler
> and dequeued all the completed waiters (incurring the overhead in the
> interrupt handler, not the best plan either as we then incur GPU
> submission latency) but we would still have to wake up the bottom-half
> every time to do the heavyweight slow read. Or we could only kick the
> waiters on the seqno with the same priority as the current task (i.e. in
> the realtime waiter scenario, only it is woken up immediately by the
> interrupt and simply queues the next waiter before returning to userspace,
> minimising its delay at the expense of the chain, and also reducing
> contention on its scheduler runqueue). This is effective at avoid long
> pauses in the interrupt handler and at avoiding the extra latency in
> realtime/high-priority waiters.
>
> v2: Convert from a kworker per engine into a dedicated kthread for the
> bottom-half.
> v3: Rename request members and tweak comments.
> v4: Use a per-engine spinlock in the breadcrumbs bottom-half.
> v5: Fix race in locklessly checking waiter status and kicking the task on
> adding a new waiter.
> v6: Fix deciding when to force the timer to hide missing interrupts.
> v7: Move the bottom-half from the kthread to the first client process.
> v8: Reword a few comments
> v9: Break the busy loop when the interrupt is unmasked or has fired.
> v10: Comments, unnecessary churn, better debugging from Tvrtko
> v11: Wake all completed waiters on removing the current bottom-half to
> reduce the latency of waking up a herd of clients all waiting on the
> same request.
> v12: Rearrange missed-interrupt fault injection so that it works with
> igt/drv_missed_irq_hang
> v13: Rename intel_breadcrumb and friends to intel_wait in preparation
> for signal handling.
> v14: RCU commentary, assert_spin_locked
> v15: Hide BUG_ON behind the compiler; report on gem_latency findings.
> v16: Sort seqno-groups by priority so that first-waiter has the highest
> task priority (and so avoid priority inversion).
> v17: Add waiters to post-mortem GPU hang state.

A lot of time has passed since I last looked at this. :I A lot of 
complexity to recall. :)

>
> Testcase: igt/gem_concurrent_blit
> Testcase: igt/benchmarks/gem_latency
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin at intel.com>
> Cc: "Gong, Zhipeng" <zhipeng.gong at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Cc: Dave Gordon <david.s.gordon at intel.com>
> Cc: "Goel, Akash" <akash.goel at intel.com>
> ---
>   drivers/gpu/drm/i915/Makefile            |   1 +
>   drivers/gpu/drm/i915/i915_debugfs.c      |  15 +-
>   drivers/gpu/drm/i915/i915_drv.h          |  39 +++-
>   drivers/gpu/drm/i915/i915_gem.c          | 130 ++++--------
>   drivers/gpu/drm/i915/i915_gpu_error.c    |  59 +++++-
>   drivers/gpu/drm/i915/i915_irq.c          |  27 +--
>   drivers/gpu/drm/i915/intel_breadcrumbs.c | 333 +++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.c         |   4 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.c  |   3 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.h  |  67 ++++++-
>   10 files changed, 567 insertions(+), 111 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/intel_breadcrumbs.c
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 7e2944406b8f..4c4e19b3f4d7 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -37,6 +37,7 @@ i915-y += i915_cmd_parser.o \
>   	  i915_gem_userptr.o \
>   	  i915_gpu_error.o \
>   	  i915_trace_points.o \
> +	  intel_breadcrumbs.o \
>   	  intel_lrc.o \
>   	  intel_mocs.o \
>   	  intel_ringbuffer.o \
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 24f4105b910f..02a923feeb7d 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -761,10 +761,21 @@ static int i915_gem_request_info(struct seq_file *m, void *data)
>   static void i915_ring_seqno_info(struct seq_file *m,
>   				 struct intel_engine_cs *engine)
>   {
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +	struct rb_node *rb;
> +
>   	seq_printf(m, "Current sequence (%s): %x\n",
>   		   engine->name, engine->get_seqno(engine));
>   	seq_printf(m, "Current user interrupts (%s): %x\n",
>   		   engine->name, READ_ONCE(engine->user_interrupts));
> +
> +	spin_lock(&b->lock);
> +	for (rb = rb_first(&b->waiters); rb != NULL; rb = rb_next(rb)) {
> +		struct intel_wait *w = container_of(rb, typeof(*w), node);
> +		seq_printf(m, "Waiting (%s): %s [%d] on %x\n",
> +			   engine->name, w->task->comm, w->task->pid, w->seqno);
> +	}
> +	spin_unlock(&b->lock);
>   }
>
>   static int i915_gem_seqno_info(struct seq_file *m, void *data)
> @@ -1400,6 +1411,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>   			   engine->hangcheck.seqno,
>   			   seqno[id],
>   			   engine->last_submitted_seqno);
> +		seq_printf(m, "\twaiters? %d\n",
> +			   intel_engine_has_waiter(engine));
>   		seq_printf(m, "\tuser interrupts = %x [current %x]\n",
>   			   engine->hangcheck.user_interrupts,
>   			   READ_ONCE(engine->user_interrupts));
> @@ -2384,7 +2397,7 @@ static int count_irq_waiters(struct drm_i915_private *i915)
>   	int count = 0;
>
>   	for_each_engine(engine, i915)
> -		count += engine->irq_refcount;
> +		count += intel_engine_has_waiter(engine);
>
>   	return count;
>   }
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 33553ea99a03..7b329464e8eb 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -503,6 +503,7 @@ struct drm_i915_error_state {
>   		bool valid;
>   		/* Software tracked state */
>   		bool waiting;
> +		int num_waiters;
>   		int hangcheck_score;
>   		enum intel_ring_hangcheck_action hangcheck_action;
>   		int num_requests;
> @@ -548,6 +549,12 @@ struct drm_i915_error_state {
>   			u32 tail;
>   		} *requests;
>
> +		struct drm_i915_error_waiter {
> +			char comm[TASK_COMM_LEN];
> +			pid_t pid;
> +			u32 seqno;
> +		} *waiters;
> +
>   		struct {
>   			u32 gfx_mode;
>   			union {
> @@ -1415,7 +1422,7 @@ struct i915_gpu_error {
>   #define I915_STOP_RING_ALLOW_WARN      (1 << 30)
>
>   	/* For missed irq/seqno simulation. */
> -	unsigned int test_irq_rings;
> +	unsigned long test_irq_rings;
>   };
>
>   enum modeset_restore {
> @@ -2941,7 +2948,6 @@ ibx_disable_display_interrupt(struct drm_i915_private *dev_priv, uint32_t bits)
>   	ibx_display_interrupt_update(dev_priv, bits, 0);
>   }
>
> -
>   /* i915_gem.c */
>   int i915_gem_create_ioctl(struct drm_device *dev, void *data,
>   			  struct drm_file *file_priv);
> @@ -3815,4 +3821,33 @@ static inline void i915_trace_irq_get(struct intel_engine_cs *engine,
>   		i915_gem_request_assign(&engine->trace_irq_req, req);
>   }
>
> +static inline bool __i915_request_irq_complete(struct drm_i915_gem_request *req)
> +{
> +	/* Ensure our read of the seqno is coherent so that we
> +	 * do not "miss an interrupt" (i.e. if this is the last
> +	 * request and the seqno write from the GPU is not visible
> +	 * by the time the interrupt fires, we will see that the
> +	 * request is incomplete and go back to sleep awaiting
> +	 * another interrupt that will never come.)
> +	 *
> +	 * Strictly, we only need to do this once after an interrupt,
> +	 * but it is easier and safer to do it every time the waiter
> +	 * is woken.
> +	 */
> +	if (i915_gem_request_completed(req, false))
> +		return true;
> +
> +	/* We need to check whether any gpu reset happened in between
> +	 * the request being submitted and now. If a reset has occurred,
> +	 * the request is effectively complete (we either are in the
> +	 * process of or have discarded the rendering and completely
> +	 * reset the GPU. The results of the request are lost and we
> +	 * are free to continue on with the original operation.
> +	 */
> +	if (req->reset_counter != i915_reset_counter(&req->i915->gpu_error))
> +		return true;
> +
> +	return false;
> +}
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 06013b9fbc6a..9348b5d7de0e 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1123,17 +1123,6 @@ i915_gem_check_wedge(unsigned reset_counter, bool interruptible)
>   	return 0;
>   }
>
> -static void fake_irq(unsigned long data)
> -{
> -	wake_up_process((struct task_struct *)data);
> -}
> -
> -static bool missed_irq(struct drm_i915_private *dev_priv,
> -		       struct intel_engine_cs *engine)
> -{
> -	return test_bit(engine->id, &dev_priv->gpu_error.missed_irq_rings);
> -}
> -
>   static unsigned long local_clock_us(unsigned *cpu)
>   {
>   	unsigned long t;
> @@ -1166,7 +1155,7 @@ static bool busywait_stop(unsigned long timeout, unsigned cpu)
>   	return this_cpu != cpu;
>   }
>
> -static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> +static bool __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   {
>   	unsigned long timeout;
>   	unsigned cpu;
> @@ -1181,17 +1170,14 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   	 * takes to sleep on a request, on the order of a microsecond.
>   	 */
>
> -	if (req->engine->irq_refcount)
> -		return -EBUSY;
> -
>   	/* Only spin if we know the GPU is processing this request */
>   	if (!i915_gem_request_started(req, true))
> -		return -EAGAIN;
> +		return false;
>
>   	timeout = local_clock_us(&cpu) + 5;
> -	while (!need_resched()) {
> +	do {
>   		if (i915_gem_request_completed(req, true))
> -			return 0;
> +			return true;
>
>   		if (signal_pending_state(state, current))
>   			break;
> @@ -1200,12 +1186,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   			break;
>
>   		cpu_relax_lowlatency();
> -	}
> -
> -	if (i915_gem_request_completed(req, false))
> -		return 0;
> +	} while (!need_resched());
>
> -	return -EAGAIN;
> +	return false;
>   }
>
>   /**
> @@ -1229,17 +1212,13 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   			s64 *timeout,
>   			struct intel_rps_client *rps)
>   {
> -	struct intel_engine_cs *engine = i915_gem_request_get_engine(req);
> -	struct drm_i915_private *dev_priv = req->i915;
> -	const bool irq_test_in_progress =
> -		ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_engine_flag(engine);
>   	int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
> -	DEFINE_WAIT(wait);
> -	unsigned long timeout_expire;
> +	struct intel_wait wait;
> +	unsigned long timeout_remain;
>   	s64 before = 0; /* Only to silence a compiler warning. */
> -	int ret;
> +	int ret = 0;
>
> -	WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
> +	might_sleep();
>
>   	if (list_empty(&req->list))
>   		return 0;
> @@ -1247,7 +1226,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   	if (i915_gem_request_completed(req, true))
>   		return 0;
>
> -	timeout_expire = 0;
> +	timeout_remain = MAX_SCHEDULE_TIMEOUT;
>   	if (timeout) {
>   		if (WARN_ON(*timeout < 0))
>   			return -EINVAL;
> @@ -1255,7 +1234,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   		if (*timeout == 0)
>   			return -ETIME;
>
> -		timeout_expire = jiffies + nsecs_to_jiffies_timeout(*timeout);
> +		timeout_remain = nsecs_to_jiffies_timeout(*timeout);
>
>   		/*
>   		 * Record current time in case interrupted by signal, or wedged.
> @@ -1263,78 +1242,57 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>   		before = ktime_get_raw_ns();
>   	}
>
> -	if (INTEL_INFO(dev_priv)->gen >= 6)
> -		gen6_rps_boost(dev_priv, rps, req->emitted_jiffies);
> -
>   	trace_i915_gem_request_wait_begin(req);
>
> -	/* Optimistic spin for the next jiffie before touching IRQs */
> -	ret = __i915_spin_request(req, state);
> -	if (ret == 0)
> -		goto out;
> +	if (INTEL_INFO(req->i915)->gen >= 6)
> +		gen6_rps_boost(req->i915, rps, req->emitted_jiffies);
>
> -	if (!irq_test_in_progress && WARN_ON(!engine->irq_get(engine))) {
> -		ret = -ENODEV;
> -		goto out;
> -	}
> -
> -	for (;;) {
> -		struct timer_list timer;
> +	/* Optimistic spin for the next ~jiffie before touching IRQs */
> +	if (__i915_spin_request(req, state))
> +		goto complete;
>
> -		prepare_to_wait(&engine->irq_queue, &wait, state);
> -
> -		/* We need to check whether any gpu reset happened in between
> -		 * the request being submitted and now. If a reset has occurred,
> -		 * the request is effectively complete (we either are in the
> -		 * process of or have discarded the rendering and completely
> -		 * reset the GPU. The results of the request are lost and we
> -		 * are free to continue on with the original operation.
> +	intel_wait_init(&wait, req->seqno);
> +	set_current_state(state);
> +	if (intel_engine_add_wait(req->engine, &wait))
> +		/* In order to check that we haven't missed the interrupt
> +		 * as we enabled it, we need to kick ourselves to do a
> +		 * coherent check on the seqno before we sleep.
>   		 */
> -		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> -			ret = 0;
> -			break;
> -		}
> -
> -		if (i915_gem_request_completed(req, false)) {
> -			ret = 0;
> -			break;
> -		}
> +		goto wakeup;
>
> +	for (;;) {
>   		if (signal_pending_state(state, current)) {
>   			ret = -ERESTARTSYS;
>   			break;
>   		}
>
> -		if (timeout && time_after_eq(jiffies, timeout_expire)) {
> +		/* Ensure that even if the GPU hangs, we get woken up. */
> +		i915_queue_hangcheck(req->i915);
> +
> +		timeout_remain = io_schedule_timeout(timeout_remain);
> +		if (timeout_remain == 0) {
>   			ret = -ETIME;
>   			break;
>   		}
>
> -		/* Ensure that even if the GPU hangs, we get woken up. */
> -		i915_queue_hangcheck(dev_priv);
> -
> -		timer.function = NULL;
> -		if (timeout || missed_irq(dev_priv, engine)) {
> -			unsigned long expire;
> -
> -			setup_timer_on_stack(&timer, fake_irq, (unsigned long)current);
> -			expire = missed_irq(dev_priv, engine) ? jiffies + 1 : timeout_expire;
> -			mod_timer(&timer, expire);
> -		}
> +		if (intel_wait_complete(&wait))
> +			break;
>
> -		io_schedule();
> +wakeup:
> +		set_current_state(state);
>
> -		if (timer.function) {
> -			del_singleshot_timer_sync(&timer);
> -			destroy_timer_on_stack(&timer);
> -		}
> +		/* Carefully check if the request is complete, giving time
> +		 * for the seqno to be visible following the interrupt.
> +		 * We also have to check in case we are kicked by the GPU
> +		 * reset in order to drop the struct_mutex.
> +		 */
> +		if (__i915_request_irq_complete(req))
> +			break;
>   	}
> -	if (!irq_test_in_progress)
> -		engine->irq_put(engine);
>
> -	finish_wait(&engine->irq_queue, &wait);
> -
> -out:
> +	intel_engine_remove_wait(req->engine, &wait);
> +	__set_current_state(TASK_RUNNING);
> +complete:
>   	trace_i915_gem_request_wait_end(req);
>
>   	if (timeout) {
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 34ff2459ceea..89241ffcc676 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -463,6 +463,18 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>   			}
>   		}
>
> +		if (error->ring[i].num_waiters) {
> +			err_printf(m, "%s --- %d waiters\n",
> +				   dev_priv->engine[i].name,
> +				   error->ring[i].num_waiters);
> +			for (j = 0; j < error->ring[i].num_waiters; j++) {
> +				err_printf(m, " seqno 0x%08x for %s [%d]\n",
> +					   error->ring[i].waiters[j].seqno,
> +					   error->ring[i].waiters[j].comm,
> +					   error->ring[i].waiters[j].pid);
> +			}
> +		}
> +
>   		if ((obj = error->ring[i].ringbuffer)) {
>   			err_printf(m, "%s --- ringbuffer = 0x%08x\n",
>   				   dev_priv->engine[i].name,
> @@ -605,8 +617,9 @@ static void i915_error_state_free(struct kref *error_ref)
>   		i915_error_object_free(error->ring[i].ringbuffer);
>   		i915_error_object_free(error->ring[i].hws_page);
>   		i915_error_object_free(error->ring[i].ctx);
> -		kfree(error->ring[i].requests);
>   		i915_error_object_free(error->ring[i].wa_ctx);
> +		kfree(error->ring[i].requests);
> +		kfree(error->ring[i].waiters);
>   	}
>
>   	i915_error_object_free(error->semaphore_obj);
> @@ -892,6 +905,47 @@ static void gen6_record_semaphore_state(struct drm_i915_private *dev_priv,
>   	}
>   }
>
> +static void engine_record_waiters(struct intel_engine_cs *engine,
> +				  struct drm_i915_error_ring *ering)
> +{
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +	struct drm_i915_error_waiter *waiter;
> +	struct rb_node *rb;
> +	int count;
> +
> +	ering->num_waiters = 0;
> +	ering->waiters = NULL;
> +
> +	spin_lock(&b->lock);
> +	count = 0;
> +	for (rb = rb_first(&b->waiters); rb != NULL; rb = rb_next(rb))
> +		count++;
> +	spin_unlock(&b->lock);
> +
> +	waiter = NULL;
> +	if (count)
> +		waiter = kmalloc(count*sizeof(struct drm_i915_error_waiter),
> +				 GFP_ATOMIC);
> +	if (!waiter)
> +		return;
> +
> +	ering->waiters = waiter;
> +
> +	spin_lock(&b->lock);
> +	for (rb = rb_first(&b->waiters); rb != NULL; rb = rb_next(rb)) {
> +		struct intel_wait *w = container_of(rb, typeof(*w), node);
> +
> +		strcpy(waiter->comm, w->task->comm);
> +		waiter->pid = w->task->pid;
> +		waiter->seqno = w->seqno;
> +		waiter++;
> +
> +		if (++ering->num_waiters == count)
> +			break;
> +	}
> +	spin_unlock(&b->lock);
> +}
> +
>   static void i915_record_ring_state(struct drm_i915_private *dev_priv,
>   				   struct drm_i915_error_state *error,
>   				   struct intel_engine_cs *engine,
> @@ -926,7 +980,7 @@ static void i915_record_ring_state(struct drm_i915_private *dev_priv,
>   		ering->instdone = I915_READ(GEN2_INSTDONE);
>   	}
>
> -	ering->waiting = waitqueue_active(&engine->irq_queue);
> +	ering->waiting = intel_engine_has_waiter(engine);
>   	ering->instpm = I915_READ(RING_INSTPM(engine->mmio_base));
>   	ering->acthd = intel_ring_get_active_head(engine);
>   	ering->seqno = engine->get_seqno(engine);
> @@ -1032,6 +1086,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
>   		error->ring[i].valid = true;
>
>   		i915_record_ring_state(dev_priv, error, engine, &error->ring[i]);
> +		engine_record_waiters(engine, &error->ring[i]);
>
>   		request = i915_gem_find_active_request(engine);
>   		if (request) {
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 27dd45a7291c..c49b8356a80d 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -988,13 +988,10 @@ static void ironlake_rps_change_irq_handler(struct drm_i915_private *dev_priv)
>
>   static void notify_ring(struct intel_engine_cs *engine)
>   {
> -	if (!intel_engine_initialized(engine))
> -		return;
> -
> -	trace_i915_gem_request_notify(engine);
> -	engine->user_interrupts++;
> -
> -	wake_up_all(&engine->irq_queue);
> +	if (intel_engine_wakeup(engine)) {
> +		trace_i915_gem_request_notify(engine);
> +		engine->user_interrupts++;
> +	}
>   }
>
>   static void vlv_c0_read(struct drm_i915_private *dev_priv,
> @@ -1075,7 +1072,7 @@ static bool any_waiters(struct drm_i915_private *dev_priv)
>   	struct intel_engine_cs *engine;
>
>   	for_each_engine(engine, dev_priv)
> -		if (engine->irq_refcount)
> +		if (intel_engine_has_waiter(engine))
>   			return true;
>
>   	return false;
> @@ -2505,8 +2502,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>   static void i915_error_wake_up(struct drm_i915_private *dev_priv,
>   			       bool reset_completed)
>   {
> -	struct intel_engine_cs *engine;
> -
>   	/*
>   	 * Notify all waiters for GPU completion events that reset state has
>   	 * been changed, and that they need to restart their wait after
> @@ -2514,9 +2509,8 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv,
>   	 * a gpu reset pending so that i915_error_work_func can acquire them).
>   	 */
>
> -	/* Wake up __wait_seqno, potentially holding dev->struct_mutex. */
> -	for_each_engine(engine, dev_priv)
> -		wake_up_all(&engine->irq_queue);
> +	/* Wake up i915_wait_request, potentially holding dev->struct_mutex. */
> +	intel_kick_waiters(dev_priv);
>
>   	/* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */
>   	wake_up_all(&dev_priv->pending_flip_queue);
> @@ -3098,13 +3092,14 @@ static unsigned kick_waiters(struct intel_engine_cs *engine)
>
>   	if (engine->hangcheck.user_interrupts == user_interrupts &&
>   	    !test_and_set_bit(engine->id, &i915->gpu_error.missed_irq_rings)) {
> -		if (!(i915->gpu_error.test_irq_rings & intel_engine_flag(engine)))
> +		if (!test_bit(engine->id, &i915->gpu_error.test_irq_rings))
>   			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
>   				  engine->name);
>   		else
>   			DRM_INFO("Fake missed irq on %s\n",
>   				 engine->name);
> -		wake_up_all(&engine->irq_queue);
> +
> +		intel_engine_enable_fake_irq(engine);
>   	}
>
>   	return user_interrupts;
> @@ -3148,7 +3143,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
>
>   	for_each_engine_id(engine, dev_priv, id) {
> -		bool busy = waitqueue_active(&engine->irq_queue);
> +		bool busy = intel_engine_has_waiter(engine);
>   		u64 acthd;
>   		u32 seqno;
>   		unsigned user_interrupts;
> diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> new file mode 100644
> index 000000000000..c82ecbf7470a
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> @@ -0,0 +1,333 @@
> +/*
> + * Copyright © 2015 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + *
> + */
> +
> +#include "i915_drv.h"
> +
> +static void intel_breadcrumbs_fake_irq(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +
> +	/*
> +	 * The timer persists in case we cannot enable interrupts,
> +	 * or if we have previously seen seqno/interrupt incoherency
> +	 * ("missed interrupt" syndrome). Here the worker will wake up
> +	 * every jiffie in order to kick the oldest waiter to do the
> +	 * coherent seqno check.
> +	 */
> +	rcu_read_lock();
> +	if (intel_engine_wakeup(engine))
> +		mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1);
> +	rcu_read_unlock();
> +}
> +
> +static void irq_enable(struct intel_engine_cs *engine)
> +{
> +	WARN_ON(!engine->irq_get(engine));
> +}
> +
> +static void irq_disable(struct intel_engine_cs *engine)
> +{
> +	engine->irq_put(engine);
> +}
> +
> +static bool __intel_breadcrumbs_enable_irq(struct intel_breadcrumbs *b)
> +{
> +	struct intel_engine_cs *engine =
> +		container_of(b, struct intel_engine_cs, breadcrumbs);
> +	struct drm_i915_private *i915 = engine->i915;
> +	bool irq_posted = false;
> +
> +	assert_spin_locked(&b->lock);
> +	if (b->rpm_wakelock)
> +		return false;
> +
> +	/* Since we are waiting on a request, the GPU should be busy
> +	 * and should have its own rpm reference. For completeness,
> +	 * record an rpm reference for ourselves to cover the
> +	 * interrupt we unmask.
> +	 */
> +	intel_runtime_pm_get_noresume(i915);
> +	b->rpm_wakelock = true;
> +
> +	/* No interrupts? Kick the waiter every jiffie! */
> +	if (intel_irqs_enabled(i915)) {
> +		if (!test_bit(engine->id, &i915->gpu_error.test_irq_rings)) {
> +			irq_enable(engine);
> +			irq_posted = true;
> +		}
> +		b->irq_enabled = true;
> +	}
> +
> +	if (!b->irq_enabled ||
> +	    test_bit(engine->id, &i915->gpu_error.missed_irq_rings))
> +		mod_timer(&b->fake_irq, jiffies + 1);
> +
> +	return irq_posted;
> +}
> +
> +static void __intel_breadcrumbs_disable_irq(struct intel_breadcrumbs *b)
> +{
> +	struct intel_engine_cs *engine =
> +		container_of(b, struct intel_engine_cs, breadcrumbs);
> +
> +	assert_spin_locked(&b->lock);
> +	if (!b->rpm_wakelock)
> +		return;
> +
> +	if (b->irq_enabled) {
> +		irq_disable(engine);
> +		b->irq_enabled = false;
> +	}
> +
> +	intel_runtime_pm_put(engine->i915);
> +	b->rpm_wakelock = false;
> +}
> +
> +static inline struct intel_wait *to_wait(struct rb_node *node)
> +{
> +	return container_of(node, struct intel_wait, node);
> +}
> +
> +static inline void __intel_breadcrumbs_finish(struct intel_breadcrumbs *b,
> +					      struct intel_wait *wait)
> +{
> +	assert_spin_locked(&b->lock);
> +
> +	/* This request is completed, so remove it from the tree, mark it as
> +	 * complete, and *then* wake up the associated task.
> +	 */
> +	rb_erase(&wait->node, &b->waiters);
> +	RB_CLEAR_NODE(&wait->node);
> +
> +	wake_up_process(wait->task); /* implicit smp_wmb() */
> +}
> +
> +bool intel_engine_add_wait(struct intel_engine_cs *engine,
> +			   struct intel_wait *wait)
> +{
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +	struct rb_node **p, *parent, *completed;
> +	bool first;
> +	u32 seqno;
> +
> +	spin_lock(&b->lock);
> +
> +	/* Insert the request into the retirement ordered list
> +	 * of waiters by walking the rbtree. If we are the oldest
> +	 * seqno in the tree (the first to be retired), then
> +	 * set ourselves as the bottom-half.
> +	 *
> +	 * As we descend the tree, prune completed branches since we hold the
> +	 * spinlock we know that the first_waiter must be delayed and can
> +	 * reduce some of the sequential wake up latency if we take action
> +	 * ourselves and wake up the completed tasks in parallel. Also, by
> +	 * removing stale elements in the tree, we may be able to reduce the
> +	 * ping-pong between the old bottom-half and ourselves as first-waiter.
> +	 */
> +	first = true;
> +	parent = NULL;
> +	completed = NULL;
> +	seqno = engine->get_seqno(engine);
> +
> +	p = &b->waiters.rb_node;
> +	while (*p) {
> +		parent = *p;
> +		if (wait->seqno == to_wait(parent)->seqno) {
> +			/* We have multiple waiters on the same seqno, select
> +			 * the highest priority task (that with the smallest
> +			 * task->prio) to serve as the bottom-half for this
> +			 * group.
> +			 */
> +			if (wait->task->prio > to_wait(parent)->task->prio) {
> +				p = &parent->rb_right;
> +				first = false;
> +			} else
> +				p = &parent->rb_left;
> +		} else if (i915_seqno_passed(wait->seqno,
> +					     to_wait(parent)->seqno)) {
> +			p = &parent->rb_right;
> +			if (i915_seqno_passed(seqno, to_wait(parent)->seqno))
> +				completed = parent;

Hm don't you need the completed handling in the equal seqno case as well?

> +			else
> +				first = false;
> +		} else
> +			p = &parent->rb_left;
> +	}
> +	rb_link_node(&wait->node, parent, p);
> +	rb_insert_color(&wait->node, &b->waiters);
> +
> +	if (completed != NULL) {
> +		struct rb_node *next = rb_next(completed);
> +
> +		if (next && next != &wait->node) {
> +			GEM_BUG_ON(first);
> +			smp_store_mb(b->first_waiter, to_wait(next)->task);
> +			/* As there is a delay between reading the current
> +			 * seqno, processing the completed tasks and selecting
> +			 * the next waiter, we may have missed the interrupt
> +			 * and so need for the next bottom-half to wakeup.
> +			 *
> +			 * Also as we enable the IRQ, we may miss the
> +			 * interrupt for that seqno, so we have to wake up
> +			 * the next bottom-half in order to do a coherent check
> +			 * in case the seqno passed.
> +			 */
> +			__intel_breadcrumbs_enable_irq(b);
> +			wake_up_process(to_wait(next)->task);
> +		}
> +
> +		do {
> +			struct intel_wait *crumb = to_wait(completed);
> +			completed = rb_prev(completed);
> +			__intel_breadcrumbs_finish(b, crumb);
> +		} while (completed != NULL);
> +	}
> +
> +	if (first) {
> +		smp_store_mb(b->first_waiter, wait->task);
> +		first =__intel_breadcrumbs_enable_irq(b);
> +	}
> +	GEM_BUG_ON(b->first_waiter == NULL);
> +
> +	spin_unlock(&b->lock);
> +
> +	return first;
> +}
> +
> +void intel_engine_enable_fake_irq(struct intel_engine_cs *engine)
> +{
> +	mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1);
> +}
> +
> +static inline bool chain_wakeup(struct rb_node *rb, int priority)
> +{
> +	return rb && to_wait(rb)->task->prio <= priority;
> +}
> +
> +void intel_engine_remove_wait(struct intel_engine_cs *engine,
> +			      struct intel_wait *wait)
> +{
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +
> +	/* Quick check to see if this waiter was already decoupled from
> +	 * the tree by the bottom-half to avoid contention on the spinlock
> +	 * by the herd.
> +	 */
> +	if (RB_EMPTY_NODE(&wait->node))
> +		return;
> +
> +	spin_lock(&b->lock);
> +
> +	if (b->first_waiter == wait->task) {
> +		struct rb_node *next;
> +		struct task_struct *task;
> +		const int priority = wait->task->prio;
> +
> +		/* We are the current bottom-half. Find the next candidate,
> +		 * the first waiter in the queue on the remaining oldest
> +		 * request. As multiple seqnos may complete in the time it
> +		 * takes us to wake up and find the next waiter, we have to
> +		 * wake up that waiter for it to perform its own coherent
> +		 * completion check.
> +		 */
> +		next = rb_next(&wait->node);
> +		if (chain_wakeup(next, priority)) {

Don't get this, next waiter my be a different seqno so how is the 
priority check relevant?

Also, how can the next node be smaller priority anyway, if equal seqno 
if has be equal or greater I thought?

Then since add_waiter will wake up one extra waiter to handle the missed 
irq case, here you may skip checking them based simply on task priority 
which seems wrong.

> +			/* If the next waiter is already complete,
> +			 * wake it up and continue onto the next waiter. So
> +			 * if have a small herd, they will wake up in parallel
> +			 * rather than sequentially, which should reduce
> +			 * the overall latency in waking all the completed
> +			 * clients.
> +			 *
> +			 * However, waking up a chain adds extra latency to
> +			 * the first_waiter. This is undesirable if that
> +			 * waiter is a high priority task.
> +			 */
> +			u32 seqno = engine->get_seqno(engine);
> +			while (i915_seqno_passed(seqno, to_wait(next)->seqno)) {
> +				struct rb_node *n = rb_next(next);
> +				__intel_breadcrumbs_finish(b, to_wait(next));
> +				next = n;
> +				if (!chain_wakeup(next, priority))
> +					break;
> +			}
> +		}
> +		task = next ? to_wait(next)->task : NULL;
> +
> +		smp_store_mb(b->first_waiter, task);
> +		if (task) {
> +			/* In our haste, we may have completed the first waiter
> +			 * before we enabled the interrupt. Do so now as we
> +			 * have a second waiter for a future seqno. Afterwards,
> +			 * we have to wake up that waiter in case we missed
> +			 * the interrupt, or if we have to handle an
> +			 * exception rather than a seqno completion.
> +			 */
> +			if (to_wait(next)->seqno != wait->seqno)
> +				__intel_breadcrumbs_enable_irq(b);
> +			wake_up_process(task);
> +		} else
> +			__intel_breadcrumbs_disable_irq(b);
> +	}
> +
> +	if (!RB_EMPTY_NODE(&wait->node))
> +		rb_erase(&wait->node, &b->waiters);
> +	spin_unlock(&b->lock);
> +}
> +
> +void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine)
> +{
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +
> +	spin_lock_init(&b->lock);
> +	setup_timer(&b->fake_irq,
> +		    intel_breadcrumbs_fake_irq,
> +		    (unsigned long)engine);
> +}
> +
> +void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine)
> +{
> +	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> +
> +	del_timer_sync(&b->fake_irq);
> +}
> +
> +unsigned intel_kick_waiters(struct drm_i915_private *i915)
> +{
> +	struct intel_engine_cs *engine;
> +	unsigned mask = 0;
> +
> +	/* To avoid the task_struct disappearing beneath us as we wake up
> +	 * the process, we must first inspect the task_struct->state under the
> +	 * RCU lock, i.e. as we call wake_up_process() we must be holding the
> +	 * rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	for_each_engine(engine, i915)
> +		if (unlikely(intel_engine_wakeup(engine)))
> +			mask |= intel_engine_flag(engine);
> +	rcu_read_unlock();
> +
> +	return mask;
> +}
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index db10c961e0f4..34c6c3fd6296 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1894,6 +1894,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
>   	i915_cmd_parser_fini_ring(engine);
>   	i915_gem_batch_pool_fini(&engine->batch_pool);
>
> +	intel_engine_fini_breadcrumbs(engine);
> +
>   	if (engine->status_page.obj) {
>   		i915_gem_object_unpin_map(engine->status_page.obj);
>   		engine->status_page.obj = NULL;
> @@ -1931,7 +1933,7 @@ logical_ring_default_irqs(struct intel_engine_cs *engine, unsigned shift)
>   {
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> -	init_waitqueue_head(&engine->irq_queue);
> +	intel_engine_init_breadcrumbs(engine);
>   }
>
>   static int
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 8d35a3978f9b..9bd3dd756d3d 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -2258,7 +2258,7 @@ static int intel_init_ring_buffer(struct drm_device *dev,
>   	memset(engine->semaphore.sync_seqno, 0,
>   	       sizeof(engine->semaphore.sync_seqno));
>
> -	init_waitqueue_head(&engine->irq_queue);
> +	intel_engine_init_breadcrumbs(engine);
>
>   	ringbuf = intel_engine_create_ringbuffer(engine, 32 * PAGE_SIZE);
>   	if (IS_ERR(ringbuf)) {
> @@ -2327,6 +2327,7 @@ void intel_cleanup_engine(struct intel_engine_cs *engine)
>
>   	i915_cmd_parser_fini_ring(engine);
>   	i915_gem_batch_pool_fini(&engine->batch_pool);
> +	intel_engine_fini_breadcrumbs(engine);
>   	engine->i915 = NULL;
>   }
>
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 929e7b4af2a4..28597ea8fbf9 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -160,6 +160,31 @@ struct intel_engine_cs {
>   	struct intel_ringbuffer *buffer;
>   	struct list_head buffers;
>
> +	/* Rather than have every client wait upon all user interrupts,
> +	 * with the herd waking after every interrupt and each doing the
> +	 * heavyweight seqno dance, we delegate the task (of being the
> +	 * bottom-half of the user interrupt) to the first client. After
> +	 * every interrupt, we wake up one client, who does the heavyweight
> +	 * coherent seqno read and either goes back to sleep (if incomplete),
> +	 * or wakes up all the completed clients in parallel, before then
> +	 * transferring the bottom-half status to the next client in the queue.
> +	 *
> +	 * Compared to walking the entire list of waiters in a single dedicated
> +	 * bottom-half, we reduce the latency of the first waiter by avoiding
> +	 * a context switch, but incur additional coherent seqno reads when
> +	 * following the chain of request breadcrumbs. Since it is most likely
> +	 * that we have a single client waiting on each seqno, then reducing
> +	 * the overhead of waking that client is much preferred.
> +	 */
> +	struct intel_breadcrumbs {
> +		spinlock_t lock; /* protects the lists of requests */
> +		struct rb_root waiters; /* sorted by retirement, priority */
> +		struct task_struct *first_waiter; /* bh for user interrupts */
> +		struct timer_list fake_irq; /* used after a missed interrupt */
> +		bool irq_enabled;
> +		bool rpm_wakelock;
> +	} breadcrumbs;
> +
>   	/*
>   	 * A pool of objects to use as shadow copies of client batch buffers
>   	 * when the command parser is enabled. Prevents the client from
> @@ -308,8 +333,6 @@ struct intel_engine_cs {
>
>   	bool gpu_caches_dirty;
>
> -	wait_queue_head_t irq_queue;
> -
>   	struct intel_context *last_context;
>
>   	struct intel_ring_hangcheck hangcheck;
> @@ -495,4 +518,44 @@ static inline u32 intel_hws_seqno_address(struct intel_engine_cs *engine)
>   	return engine->status_page.gfx_addr + I915_GEM_HWS_INDEX_ADDR;
>   }
>
> +/* intel_breadcrumbs.c -- user interrupt bottom-half for waiters */
> +struct intel_wait {
> +	struct rb_node node;
> +	struct task_struct *task;
> +	u32 seqno;
> +};
> +void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine);
> +static inline void intel_wait_init(struct intel_wait *wait, u32 seqno)
> +{
> +	wait->task = current;
> +	wait->seqno = seqno;
> +}
> +static inline bool intel_wait_complete(const struct intel_wait *wait)
> +{
> +	return RB_EMPTY_NODE(&wait->node);
> +}
> +bool intel_engine_add_wait(struct intel_engine_cs *engine,
> +			   struct intel_wait *wait);
> +void intel_engine_remove_wait(struct intel_engine_cs *engine,
> +			      struct intel_wait *wait);
> +static inline bool intel_engine_has_waiter(struct intel_engine_cs *engine)
> +{
> +	return READ_ONCE(engine->breadcrumbs.first_waiter);
> +}
> +static inline bool intel_engine_wakeup(struct intel_engine_cs *engine)
> +{
> +	bool wakeup = false;
> +	struct task_struct *task = READ_ONCE(engine->breadcrumbs.first_waiter);
> +	/* Note that for this not to dangerously chase a dangling pointer,
> +	 * the caller is responsible for ensure that the task remain valid for
> +	 * wake_up_process() i.e. that the RCU grace period cannot expire.
> +	 */

This gets called from hard irq context and I did not manage to figure 
out what makes it safe in the remove waiter path? Why the hard irq 
couldn't not sample NULL here?

> +	if (task)
> +		wakeup = wake_up_process(task);
> +	return wakeup;
> +}
> +void intel_engine_enable_fake_irq(struct intel_engine_cs *engine);
> +void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine);
> +unsigned intel_kick_waiters(struct drm_i915_private *i915);
> +
>   #endif /* _INTEL_RINGBUFFER_H_ */
>

Otherwise still looks like a good idea! :)

Regards,

Tvrtko