[Intel-gfx] [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Dec 18 05:51:38 PST 2015
On 18/12/15 12:28, Chris Wilson wrote:
> On Fri, Dec 18, 2015 at 11:59:41AM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> We can rely on context complete interrupt to wake up the waiters
>> apart in the case where requests are merged into a single ELSP
>> submission. In this case we inject MI_USER_INTERRUPTS in the
>> ring buffer to ensure prompt wake-ups.
>>
>> This optimization has the effect on for example GLBenchmark
>> Egypt off-screen test of decreasing the number of generated
>> interrupts per second by a factor of two, and context switched
>> by factor of five to six.
>
> I half like it. Are the interupts a limiting factor in this case though?
> This should be ~100 waits/second with ~1000 batches/second, right? What
> is the delay between request completion and client wakeup - difficult to
> measure after you remove the user interrupt though! But I estimate it
> should be on the order of just a few GPU cycles.
Neither of the two benchmarks I ran (trex onscreen and egypt offscreen)
show any framerate improvements.
The only thing I did manage to measure is that CPU energy usage goes
down with the optimisation. Roughly 8-10%, courtesy of RAPL script
someone posted here.
Benchmarking is generally very hard so it is a pity we don't have a farm
similar to CI which does it all in a repeatable and solid manner.
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 27f06198a51e..d9be878dbde7 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -359,6 +359,13 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
>> spin_unlock(&dev_priv->uncore.lock);
>> }
>>
>> +static void execlists_emit_user_interrupt(struct drm_i915_gem_request *req)
>> +{
>> + struct intel_ringbuffer *ringbuf = req->ringbuf;
>> +
>> + iowrite32(MI_USER_INTERRUPT, ringbuf->virtual_start + req->tail - 8);
>> +}
>> +
>> static int execlists_update_context(struct drm_i915_gem_request *rq)
>> {
>> struct intel_engine_cs *ring = rq->ring;
>> @@ -433,6 +440,12 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>> cursor->elsp_submitted = req0->elsp_submitted;
>> list_move_tail(&req0->execlist_link,
>> &ring->execlist_retired_req_list);
>> + /*
>> + * When merging requests make sure there is still
>> + * something after each batch buffer to wake up waiters.
>> + */
>> + if (cursor != req0)
>> + execlists_emit_user_interrupt(req0);
>
> You may have already missed this instruction as you patch it, and keep
> doing so as long as the context is resubmitted. I think to be safe, you
> need to patch cursor as well. You could then MI_NOOP out the MI_INTERUPT
> on the terminal request.
I don't at the moment see it could miss it? We don't do preemption, but
granted I don't understand this code fully.
But patching it out definitely looks safer. And I even don't have to
unbreak GuC in that case. So I'll try that approach.
> An interesting igt experiement I think would be:
>
> thread A, keep queuing batches with just a single MI_STORE_DWORD_IMM *addr
> thread B, waits on batch from A, reads *addr (asynchronously), measures
> latency (actual value - expected(batch))
>
> Run for 10s, report min/max/median latency.
>
> Repeat for more threads/contexts and more waiters. Ah, that may be the
> demonstration for the thundering herd I've been looking for!
Hm I'll think about it.
Wrt your second reply, that is an interesting question.
All I can tell that empirically it looks interrupts do arrive split,
otherwise there would be no reduction in interrupt numbers. But why are
they split I don't know.
I'll try adding some counters to get a feel how often does that happen
in various scenarios.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list