[Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

Fri Nov 27 05:01:12 PST 2015

On Fri, Nov 27, 2015 at 12:11:32PM +0000, Tvrtko Ursulin wrote:
> 
> Hi,
> 
> On 31/10/15 10:34, Chris Wilson wrote:
> >One particularly stressful scenario consists of many independent tasks
> >all competing for GPU time and waiting upon the results (e.g. realtime
> >transcoding of many, many streams). One bottleneck in particular is that
> >each client waits on its own results, but every client is woken up after
> >every batchbuffer - hence the thunder of hooves as then every client must
> >do its heavyweight dance to read a coherent seqno to see if it is the
> >lucky one. Alternatively, we can have one worker responsible for wakeing
> >after an interrupt, checking the seqno and only wakeing up the clients
> >who are complete. The disadvantage is that in the uncontended scenario
> >(i.e. only one waiter) we incur an extra context switch in the wakeup
> >path - though that should be mitigated somewhat by the busywait we do
> >first before sleeping.
> 
> Could you explain the design in a bit more detail in the commit message?

One worker responsible for waking up after an interrupt and waking
clients who are complete?

> And add some more comments in the code, where key things are
> happening, new struct members, etc?

I did.

> >+		timer.function = NULL;
> >+		if (fake_irq || missed_irq(engine)) {
> >+			setup_timer_on_stack(&timer,
> >+					     (void (*)(unsigned long))fake_irq,
> 
> Kaboom when it fires? :)

It was s/fake_irq/wake_up_process/ at some point obviously forgot to
merge that back in (the whole point of all that casting).

> >+					     (unsigned long)current);
> >+			mod_timer(&timer, jiffies + 1);
> >+		}
> 
> I still see no benefit in complicating things with a timer. It is
> just a needlessly contrived way of doing a schedule_timeout. And I
> don't see we would need to extend the mechanism for more precision
> since it only comes into play in such borderline conditions that it
> doesn't matter.

I like the historical reasoning, having suffered the pain long ago. But
there is less reason not to use schedule_timeout() unlike the only
recently introduced io_schedule_timeout(). Fine. Have it your way!

> >+
> >+		/* Unlike the individual clients, we do not want this
> >+		 * background thread to contribute to the system load,
> >+		 * i.e. we do not want to use io_schedule() here.
> >+		 */
> >+		schedule();
> 
> I am also thinking about whether busy spin makes sense more in wait
> request or here. Hm.. With that optimisation which makes only the
> waiter on the next request in the queue spin it is OK to do it
> there.

No. I don't think that makes sense. To get here we have demonstrated that
we are in a long-wait sequence, and we already the code already has the
inclination to prevent busy-spins when we are waiting. Now we do introduce
yet another context switch between the irq and client, but most of the
latency is introduced by setting up the irq in the first place.

> (So please respin that patch series as well.)

In review there was only a quick rename, but I also want to change it to
a 10us timeout as that is favourable to more synchronous igt loads.

> But if that brings an improvement would a short spin be beneficial
> here as well? Less so because waiters are already sleeping but could
> a bit I suppose.

Indeed, much less of an improvement due to the context switches.

> >+void intel_engine_add_wakeup(struct drm_i915_gem_request *request)
> >+{
> >+	struct intel_engine_cs *engine = i915_gem_request_get_ring(request);
> >+
> >+	spin_lock(&engine->irq_lock);
> >+	if (request->irq_count++ == 0) {
> >+		struct rb_node **p, *parent;
> >+		bool first;
> >+
> >+		if (RB_EMPTY_ROOT(&engine->irq_requests))
> >+			schedule_work(&engine->irq_work);
> 
> Worth thinking about a dedicated, maybe even CPU local work queue
> for maximum responsiveness?

Considered a kthread per ring. Downsides are that it consumes a slot for
mostly idle threads, which are what the kworkers are for. (I am not keen
on the complaints we would get for 6 kirq-i915/* threads.)

What have I considered is doing intel_engine_add_wakeup() before the
spin_request. If we have a spare CPU, that CPU will be enabling the IRQ
whilst we do the busy spin. The complication is that doing the wakeup,
schedule and rb insert isn't particularly cheap.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre