[Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

Fri Nov 27 06:10:59 PST 2015

On Fri, Nov 27, 2015 at 01:53:34PM +0000, Tvrtko Ursulin wrote:
> 
> On 27/11/15 13:01, Chris Wilson wrote:
> >On Fri, Nov 27, 2015 at 12:11:32PM +0000, Tvrtko Ursulin wrote:
> >>
> >>Hi,
> >>
> >>On 31/10/15 10:34, Chris Wilson wrote:
> >>>One particularly stressful scenario consists of many independent tasks
> >>>all competing for GPU time and waiting upon the results (e.g. realtime
> >>>transcoding of many, many streams). One bottleneck in particular is that
> >>>each client waits on its own results, but every client is woken up after
> >>>every batchbuffer - hence the thunder of hooves as then every client must
> >>>do its heavyweight dance to read a coherent seqno to see if it is the
> >>>lucky one. Alternatively, we can have one worker responsible for wakeing
> >>>after an interrupt, checking the seqno and only wakeing up the clients
> >>>who are complete. The disadvantage is that in the uncontended scenario
> >>>(i.e. only one waiter) we incur an extra context switch in the wakeup
> >>>path - though that should be mitigated somewhat by the busywait we do
> >>>first before sleeping.
> >>
> >>Could you explain the design in a bit more detail in the commit message?
> >
> >One worker responsible for waking up after an interrupt and waking
> >clients who are complete?
> 
> I think more is required. At least explain the request tree, how it
> is built and used, and the flow a bit.
> 
> >>And add some more comments in the code, where key things are
> >>happening, new struct members, etc?
> >
> >I did.
> 
> Maybe in some patch I don't see? The posted has zero comments on new
> members added to intel_engine_cs and incomplete comments for the
> same in drm_i915_gem_request.

/* Used for irq processing of waiting requests */

> And the the code throughout has really to few comments relative to
> the complexity.

Where's the complexity? If you tell me as a reader what is not obvious
that helps. I have tried to make the code/variables reasonably
descriptive, and my goal in comments is to explain why not what.

> There is actually only one, about schedule() which is not really the
> main point! :)

But that tells me a lot about the why (why this and not something else).

> Nothing about the tree handling, state variables like
> irq_first and similar.

But they were simple? It's just a sorted list (using the available
rbtree implementation) with an optimisation to store the oldest seqno.
There's very little "why?" I felt I needed to explain there.

> >>>+
> >>>+		/* Unlike the individual clients, we do not want this
> >>>+		 * background thread to contribute to the system load,
> >>>+		 * i.e. we do not want to use io_schedule() here.
> >>>+		 */
> >>>+		schedule();
> >>
> >>I am also thinking about whether busy spin makes sense more in wait
> >>request or here. Hm.. With that optimisation which makes only the
> >>waiter on the next request in the queue spin it is OK to do it
> >>there.
> >
> >No. I don't think that makes sense. To get here we have demonstrated that
> >we are in a long-wait sequence, and we already the code already has the
> >inclination to prevent busy-spins when we are waiting. Now we do introduce
> >yet another context switch between the irq and client, but most of the
> >latency is introduced by setting up the irq in the first place.
> 
> What doesn't make sense? To keep the busy spin, even optimized, in
> wait request after this patch?

Oh, I thought you were talking about adding a spin here, i.e. after
we are woken up to complete a requests, busywait on the next one before
sleeping.

> >>(So please respin that patch series as well.)
> >
> >In review there was only a quick rename, but I also want to change it to
> >a 10us timeout as that is favourable to more synchronous igt loads.
> 
> I thought I found a reversal of logic in "drm/i915: Only spin whilst
> waiting on the current request" which needs fixing?

That was trivial, it was rebasing a patch from using this kworker to
vanilla that introduced the error.

> >>Worth thinking about a dedicated, maybe even CPU local work queue
> >>for maximum responsiveness?
> >
> >Considered a kthread per ring. Downsides are that it consumes a slot for
> >mostly idle threads, which are what the kworkers are for. (I am not keen
> >on the complaints we would get for 6 kirq-i915/* threads.)
> 
> Also, I like the worker since when there is no one waiting it does
> not get woken up from notify_ring().

A kthread also would keep the irq enabled to a minimum. I've seen the
misery of having the IRQ fire after breadcrumb and do not want it back.

> So was just thinking, wasn't sure, if using the system wq makes it
> compete with other jobs. If so dedicated worker wouldn't harm. Could
> make it high priority as well. Not sure it would make sense to make
> it bound any more. Probably the period for sleep to wakeup is too
> long for locality effects.

Right, this was why I though a kthread would be better as we wouldn't
monopolise a kworker thread by sleeping for long periods of time. Not
sure what the answer is, but with a more complicated single kirq-i915
design perhaps the tradeoff is towards the kthread.

> P.S. And just realised this work is competing with the scheduler
> which changes all this again.

On the other hand, there are regressions to be solved before more
features.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre