[Intel-gfx] [PATCH v2] drm/i915: Execlist irq handler micro optimisations

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Fri Feb 12 13:46:35 UTC 2016

On 12/02/16 12:00, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Assorted changes most likely without any practical effect
> apart from a tiny reduction in generated code for the interrupt
> handler and request submission.
>   * Remove needless initialization.
>   * Improve cache locality by reorganizing code and/or using
>     branch hints to keep unexpected or error conditions out
>     of line.
>   * Favor busy submit path vs. empty queue.
>   * Less branching in hot-paths.
> v2:
>   * Avoid mmio reads when possible. (Chris Wilson)
>   * Use natural integer size for csb indices.
>   * Remove useless return value from execlists_update_context.
>   * Extract 32-bit ppgtt PDPs update so it is out of line and
>     shared with two callers.
>   * Grab forcewake across all mmio operations to ease the
>     load on uncore lock and use chepear mmio ops.
> Version 2 now makes the irq handling code path ~20% smaller on
> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> paths are mostly in-line now and hammering on the uncore
> spinlock is greatly reduced together with mmio traffic to an
> extent.

Is gem_latency an interesting benchmark for this?

Five runs on vanilla:

747693/1:   9.080us   2.000us   2.000us 121.840us
742108/1:   9.060us   2.520us   2.520us 122.645us
744097/1:   9.060us   2.000us   2.000us 122.372us
744056/1:   9.180us   1.980us   1.980us 122.394us
742610/1:   9.040us   2.560us   2.560us 122.525us

Five runs with this patch series:

786532/1:  10.760us   1.520us   1.520us 115.705us
780735/1:  10.740us   1.580us   1.580us 116.558us
783706/1:  10.800us   1.460us   1.460us 116.280us
784135/1:  10.800us   1.520us   1.520us 116.151us
784037/1:  10.740us   1.520us   1.520us 116.250us

So it looks all got better apart from dispatch latency.

5% more throughput, 30% better consumer and producer latencies, 5% less 
CPU usage, but 18% worse dispatch latency.




More information about the Intel-gfx mailing list