[Intel-gfx] [PATCH v2] drm/i915: Execlist irq handler micro optimisations

Fri Feb 12 14:30:43 UTC 2016

On Fri, Feb 12, 2016 at 01:46:35PM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 12/02/16 12:00, Tvrtko Ursulin wrote:
> >From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >
> >Assorted changes most likely without any practical effect
> >apart from a tiny reduction in generated code for the interrupt
> >handler and request submission.
> >
> >  * Remove needless initialization.
> >  * Improve cache locality by reorganizing code and/or using
> >    branch hints to keep unexpected or error conditions out
> >    of line.
> >  * Favor busy submit path vs. empty queue.
> >  * Less branching in hot-paths.
> >
> >v2:
> >
> >  * Avoid mmio reads when possible. (Chris Wilson)
> >  * Use natural integer size for csb indices.
> >  * Remove useless return value from execlists_update_context.
> >  * Extract 32-bit ppgtt PDPs update so it is out of line and
> >    shared with two callers.
> >  * Grab forcewake across all mmio operations to ease the
> >    load on uncore lock and use chepear mmio ops.
> >
> >Version 2 now makes the irq handling code path ~20% smaller on
> >48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> >paths are mostly in-line now and hammering on the uncore
> >spinlock is greatly reduced together with mmio traffic to an
> >extent.
> 
> Is gem_latency an interesting benchmark for th

Yes, we should be able to detect changes on the order of 100ns and if
the results are stable and above the noise, then definitely.

"./gem_latency" sends one batch and then waits up it, so I only the patch
to directly affect the dispatch latency. I'd say the wake up latency is
solely due to reducing the spinlock contextion.

"./gem_latency -n 100" would queue 100 no-op batches before the
measurement. That may help to look at the overhead of handling the
context-switch interrupt.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre