[Intel-gfx] [PATCH v2] drm/i915: Execlist irq handler micro optimisations

Fri Feb 12 15:54:27 UTC 2016

On 12/02/16 14:42, Chris Wilson wrote:
> On Fri, Feb 12, 2016 at 12:00:40PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> Assorted changes most likely without any practical effect
>> apart from a tiny reduction in generated code for the interrupt
>> handler and request submission.
>>
>>   * Remove needless initialization.
>>   * Improve cache locality by reorganizing code and/or using
>>     branch hints to keep unexpected or error conditions out
>>     of line.
>>   * Favor busy submit path vs. empty queue.
>>   * Less branching in hot-paths.
>>
>> v2:
>>
>>   * Avoid mmio reads when possible. (Chris Wilson)
>>   * Use natural integer size for csb indices.
>>   * Remove useless return value from execlists_update_context.
>>   * Extract 32-bit ppgtt PDPs update so it is out of line and
>>     shared with two callers.
>>   * Grab forcewake across all mmio operations to ease the
>>     load on uncore lock and use chepear mmio ops.
>>
>> Version 2 now makes the irq handling code path ~20% smaller on
>> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
>> paths are mostly in-line now and hammering on the uncore
>> spinlock is greatly reduced together with mmio traffic to an
>> extent.
>
> Did you notice that ring->next_context_status_buffer is redundant as we
> also have that information to hand in status_pointer?

I didn't and don't know that part that well. There might be some future 
proofing issues around it as well.

> What's your thinking for
>
> 	if (req->elsp_submitted & ring->gen8_9)
>
> vs a plain
>
> 	if (req->elsp_submitted)
> ?

Another don't know this part that well. Is it not useful to not submit 
two noops if they are not needed? Do they still end up submitted to the 
GPU somehow?

> The tidies look good. Be useful to double check whether gem_latency is
> behaving as a canary, it's a bit of a puzzle why that first dispatch
> latency would grow.

Yes a puzzle, no idea how and why. But "gem_latency -n 100" does not 
show this regression. I've done a hundred runs and these are the results:

  * Throughput up 4.04%
  * Dispatch latency down 0.37%
  * Consumer and producer latencies down 22.53%
  * CPU time down 2.25%

So it all looks good.

Regards,

Tvrtko