[Intel-gfx] [PATCH v2] drm/i915: Execlist irq handler micro optimisations
tvrtko.ursulin at linux.intel.com
Fri Feb 12 15:54:27 UTC 2016
On 12/02/16 14:42, Chris Wilson wrote:
> On Fri, Feb 12, 2016 at 12:00:40PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Assorted changes most likely without any practical effect
>> apart from a tiny reduction in generated code for the interrupt
>> handler and request submission.
>> * Remove needless initialization.
>> * Improve cache locality by reorganizing code and/or using
>> branch hints to keep unexpected or error conditions out
>> of line.
>> * Favor busy submit path vs. empty queue.
>> * Less branching in hot-paths.
>> * Avoid mmio reads when possible. (Chris Wilson)
>> * Use natural integer size for csb indices.
>> * Remove useless return value from execlists_update_context.
>> * Extract 32-bit ppgtt PDPs update so it is out of line and
>> shared with two callers.
>> * Grab forcewake across all mmio operations to ease the
>> load on uncore lock and use chepear mmio ops.
>> Version 2 now makes the irq handling code path ~20% smaller on
>> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
>> paths are mostly in-line now and hammering on the uncore
>> spinlock is greatly reduced together with mmio traffic to an
> Did you notice that ring->next_context_status_buffer is redundant as we
> also have that information to hand in status_pointer?
I didn't and don't know that part that well. There might be some future
proofing issues around it as well.
> What's your thinking for
> if (req->elsp_submitted & ring->gen8_9)
> vs a plain
> if (req->elsp_submitted)
Another don't know this part that well. Is it not useful to not submit
two noops if they are not needed? Do they still end up submitted to the
> The tidies look good. Be useful to double check whether gem_latency is
> behaving as a canary, it's a bit of a puzzle why that first dispatch
> latency would grow.
Yes a puzzle, no idea how and why. But "gem_latency -n 100" does not
show this regression. I've done a hundred runs and these are the results:
* Throughput up 4.04%
* Dispatch latency down 0.37%
* Consumer and producer latencies down 22.53%
* CPU time down 2.25%
So it all looks good.
More information about the Intel-gfx