[Intel-gfx] [PATCH v4] drm/i915: Execlists small cleanups and micro-optimisations

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Mon Feb 29 10:45:34 UTC 2016



On 26/02/16 20:24, Chris Wilson wrote:
> On Fri, Feb 26, 2016 at 04:58:32PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> Assorted changes in the areas of code cleanup, reduction of
>> invariant conditional in the interrupt handler and lock
>> contention and MMIO access optimisation.
>>
>>   * Remove needless initialization.
>>   * Improve cache locality by reorganizing code and/or using
>>     branch hints to keep unexpected or error conditions out
>>     of line.
>>   * Favor busy submit path vs. empty queue.
>>   * Less branching in hot-paths.
>>
>> v2:
>>
>>   * Avoid mmio reads when possible. (Chris Wilson)
>>   * Use natural integer size for csb indices.
>>   * Remove useless return value from execlists_update_context.
>>   * Extract 32-bit ppgtt PDPs update so it is out of line and
>>     shared with two callers.
>>   * Grab forcewake across all mmio operations to ease the
>>     load on uncore lock and use chepear mmio ops.
>>
>> v3:
>>
>>   * Removed some more pointless u8 data types.
>>   * Removed unused return from execlists_context_queue.
>>   * Commit message updates.
>>
>> v4:
>>   * Unclumsify the unqueue if statement. (Chris Wilson)
>>   * Hide forcewake from the queuing function. (Chris Wilson)
>>
>> Version 3 now makes the irq handling code path ~20% smaller on
>> 48-bit PPGTT hardware, and a little bit less elsewhere. Hot
>> paths are mostly in-line now and hammering on the uncore
>> spinlock is greatly reduced together with mmio traffic to an
>> extent.
>>
>> Benchmarking with "gem_latency -n 100" (keep submitting
>> batches with 100 nop instruction) shows approximately 4% higher
>> throughput, 2% less CPU time and 22% smaller latencies. This was
>> on a big-core while small-cores could benefit even more.
>
> Just add a quick comment about "gem_latency -n 0" suggesting an oddity
> with synchronous workloads that bears further study (just so that we
> have the hint/reminder about the test case to run).
>

This ok?

"""
One unexplained result is with "gem_latency -n 0" (dispatching
empty batches) which shows 5% more throughput, 8% less CPU time,
25% better producer and consumer latencies, but 15% higher
dispatch latency which looks like a possible measuring artifact.
"""

>> Most likely reason for the improvements are the MMIO
>> optimization and uncore lock traffic reduction.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>

Thanks!

Regards,

Tvrtko



More information about the Intel-gfx mailing list