[Intel-gfx] [PATCH v2] drm/i915: Execlist irq handler micro optimisations
tvrtko.ursulin at linux.intel.com
Fri Feb 12 13:46:35 UTC 2016
On 12/02/16 12:00, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Assorted changes most likely without any practical effect
> apart from a tiny reduction in generated code for the interrupt
> handler and request submission.
> * Remove needless initialization.
> * Improve cache locality by reorganizing code and/or using
> branch hints to keep unexpected or error conditions out
> of line.
> * Favor busy submit path vs. empty queue.
> * Less branching in hot-paths.
> * Avoid mmio reads when possible. (Chris Wilson)
> * Use natural integer size for csb indices.
> * Remove useless return value from execlists_update_context.
> * Extract 32-bit ppgtt PDPs update so it is out of line and
> shared with two callers.
> * Grab forcewake across all mmio operations to ease the
> load on uncore lock and use chepear mmio ops.
> Version 2 now makes the irq handling code path ~20% smaller on
> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> paths are mostly in-line now and hammering on the uncore
> spinlock is greatly reduced together with mmio traffic to an
Is gem_latency an interesting benchmark for this?
Five runs on vanilla:
747693/1: 9.080us 2.000us 2.000us 121.840us
742108/1: 9.060us 2.520us 2.520us 122.645us
744097/1: 9.060us 2.000us 2.000us 122.372us
744056/1: 9.180us 1.980us 1.980us 122.394us
742610/1: 9.040us 2.560us 2.560us 122.525us
Five runs with this patch series:
786532/1: 10.760us 1.520us 1.520us 115.705us
780735/1: 10.740us 1.580us 1.580us 116.558us
783706/1: 10.800us 1.460us 1.460us 116.280us
784135/1: 10.800us 1.520us 1.520us 116.151us
784037/1: 10.740us 1.520us 1.520us 116.250us
So it looks all got better apart from dispatch latency.
5% more throughput, 30% better consumer and producer latencies, 5% less
CPU usage, but 18% worse dispatch latency.
More information about the Intel-gfx