[Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

Fri Nov 27 05:53:34 PST 2015

On 27/11/15 13:01, Chris Wilson wrote:
> On Fri, Nov 27, 2015 at 12:11:32PM +0000, Tvrtko Ursulin wrote:
>>
>> Hi,
>>
>> On 31/10/15 10:34, Chris Wilson wrote:
>>> One particularly stressful scenario consists of many independent tasks
>>> all competing for GPU time and waiting upon the results (e.g. realtime
>>> transcoding of many, many streams). One bottleneck in particular is that
>>> each client waits on its own results, but every client is woken up after
>>> every batchbuffer - hence the thunder of hooves as then every client must
>>> do its heavyweight dance to read a coherent seqno to see if it is the
>>> lucky one. Alternatively, we can have one worker responsible for wakeing
>>> after an interrupt, checking the seqno and only wakeing up the clients
>>> who are complete. The disadvantage is that in the uncontended scenario
>>> (i.e. only one waiter) we incur an extra context switch in the wakeup
>>> path - though that should be mitigated somewhat by the busywait we do
>>> first before sleeping.
>>
>> Could you explain the design in a bit more detail in the commit message?
>
> One worker responsible for waking up after an interrupt and waking
> clients who are complete?

I think more is required. At least explain the request tree, how it is 
built and used, and the flow a bit.

>> And add some more comments in the code, where key things are
>> happening, new struct members, etc?
>
> I did.

Maybe in some patch I don't see? The posted has zero comments on new 
members added to intel_engine_cs and incomplete comments for the same in 
drm_i915_gem_request.

And the the code throughout has really to few comments relative to the 
complexity.

There is actually only one, about schedule() which is not really the 
main point! :) Nothing about the tree handling, state variables like 
irq_first and similar.

>>> +		timer.function = NULL;
>>> +		if (fake_irq || missed_irq(engine)) {
>>> +			setup_timer_on_stack(&timer,
>>> +					     (void (*)(unsigned long))fake_irq,
>>
>> Kaboom when it fires? :)
>
> It was s/fake_irq/wake_up_process/ at some point obviously forgot to
> merge that back in (the whole point of all that casting).
>
>>> +					     (unsigned long)current);
>>> +			mod_timer(&timer, jiffies + 1);
>>> +		}
>>
>> I still see no benefit in complicating things with a timer. It is
>> just a needlessly contrived way of doing a schedule_timeout. And I
>> don't see we would need to extend the mechanism for more precision
>> since it only comes into play in such borderline conditions that it
>> doesn't matter.
>
> I like the historical reasoning, having suffered the pain long ago. But
> there is less reason not to use schedule_timeout() unlike the only
> recently introduced io_schedule_timeout(). Fine. Have it your way!

Yay! :)

>>> +
>>> +		/* Unlike the individual clients, we do not want this
>>> +		 * background thread to contribute to the system load,
>>> +		 * i.e. we do not want to use io_schedule() here.
>>> +		 */
>>> +		schedule();
>>
>> I am also thinking about whether busy spin makes sense more in wait
>> request or here. Hm.. With that optimisation which makes only the
>> waiter on the next request in the queue spin it is OK to do it
>> there.
>
> No. I don't think that makes sense. To get here we have demonstrated that
> we are in a long-wait sequence, and we already the code already has the
> inclination to prevent busy-spins when we are waiting. Now we do introduce
> yet another context switch between the irq and client, but most of the
> latency is introduced by setting up the irq in the first place.

What doesn't make sense? To keep the busy spin, even optimized, in wait 
request after this patch?

>> (So please respin that patch series as well.)
>
> In review there was only a quick rename, but I also want to change it to
> a 10us timeout as that is favourable to more synchronous igt loads.

I thought I found a reversal of logic in "drm/i915: Only spin whilst 
waiting on the current request" which needs fixing?

>> But if that brings an improvement would a short spin be beneficial
>> here as well? Less so because waiters are already sleeping but could
>> a bit I suppose.
>
> Indeed, much less of an improvement due to the context switches.
>
>>> +void intel_engine_add_wakeup(struct drm_i915_gem_request *request)
>>> +{
>>> +	struct intel_engine_cs *engine = i915_gem_request_get_ring(request);
>>> +
>>> +	spin_lock(&engine->irq_lock);
>>> +	if (request->irq_count++ == 0) {
>>> +		struct rb_node **p, *parent;
>>> +		bool first;
>>> +
>>> +		if (RB_EMPTY_ROOT(&engine->irq_requests))
>>> +			schedule_work(&engine->irq_work);
>>
>> Worth thinking about a dedicated, maybe even CPU local work queue
>> for maximum responsiveness?
>
> Considered a kthread per ring. Downsides are that it consumes a slot for
> mostly idle threads, which are what the kworkers are for. (I am not keen
> on the complaints we would get for 6 kirq-i915/* threads.)

Also, I like the worker since when there is no one waiting it does not 
get woken up from notify_ring().

So was just thinking, wasn't sure, if using the system wq makes it 
compete with other jobs. If so dedicated worker wouldn't harm. Could 
make it high priority as well. Not sure it would make sense to make it 
bound any more. Probably the period for sleep to wakeup is too long for 
locality effects.

> What have I considered is doing intel_engine_add_wakeup() before the
> spin_request. If we have a spare CPU, that CPU will be enabling the IRQ
> whilst we do the busy spin. The complication is that doing the wakeup,
> schedule and rb insert isn't particularly cheap.

Yes I thought about is as well, best leave those experiments for later.

Regards,

Tvrtko

P.S. And just realised this work is competing with the scheduler which 
changes all this again.