[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

Tue May 29 15:58:02 UTC 2018

Hi,

On 25.05.2018 00:55, Jason Ekstrand wrote:
> This patch series adds back-end compiler support for SIMD32 fragment
> shaders.  Support is added and everything works but it's currently hidden
> behind INTEL_DEBUG=do32.  We know that it improves performance in some
> cases but we do not yet have a good enough heuristic to start turning it on
> by default.  The objective of this series is to just to get the compiler
> infrastructure landed so that it stops bit-rotting in Curro's branch.

Tested v3 on BXT & SKL.  Everything seems to work otherwise fine.

Tested-by Eero Tamminen <eero.t.tamminen at intel.com>

> Figuring out a good heuristic is left as an exercise to the reader. :-)

Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, 
VSInstancing & ZBuffer
* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.

SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.

More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler
that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)

	- Eero

> Francisco Jerez (34):
>    intel/eu: Remove brw_codegen::compressed_stack.
>    intel/fs: Rename a local variable so it doesn't shadow component()
>    intel/fs: Use the ATTR file for FS inputs
>    intel/fs: Replace the CINTERP opcode with a simple MOV
>    intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
>    intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
>      writes.
>    intel/eu: Return new instruction to caller from brw_fb_WRITE().
>    intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
>    intel/fs: Fix implied_mrf_writes() for headerless FB writes.
>    intel/fs: Remove program key argument from generator.
>    intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
>    intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
>    intel/eu: Fix pixel interpolator queries for SIMD32.
>    intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
>    intel/fs: Don't enable dual source blend if no outputs are written
>    intel/fs: Fix FB write message control codegen for SIMD32.
>    intel/fs: Fix logical FB write lowering for SIMD32
>    intel/fs: Fix FB read header setup for SIMD32.
>    intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
>    intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
>      pre-Gen7.
>    intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
>    i965: Add plumbing for shader time in 32-wide FS dispatch mode.
>    intel/fs: Simplify fs_visitor::emit_samplepos_setup
>    intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
>      workaround
>    intel/fs: Wrap FS payload register look-up in a helper function.
>    intel/fs: Extend thread payload layout to SIMD32
>    intel/fs: Implement 32-wide FS payload setup on Gen6+
>    intel/fs: Fix Gen7 compressed source region alignment restriction for
>      SIMD32
>    intel/fs: Fix sample id setup for SIMD32.
>    intel/fs: Generalize the unlit centroid workaround
>    intel/fs: Fix Gen6+ interpolation setup for SIMD32
>    intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
>    intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
>    intel/fs: Build 32-wide FS shaders.
> 
> Jason Ekstrand (19):
>    intel/fs: Assert that the gen4-6 plane restrictions are followed
>    intel/fs: Use groups for SIMD16 LINTERP on gen11+
>    intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
>    intel/fs: Properly track implied header regs read by FB writes
>    intel/fs: Pull FB write implied headers from src[0]
>    intel/fs: Set up FB write message headers in the visitor
>    i965: Re-arrange shader kernel setup in WM state
>    intel/compiler: Add and use helpers for working with KSP indices
>    intel/fs: Rework KSP data to be SIMD width-based
>    intel/fs: Split instructions low to high in lower_simd_width
>    intel/fs: Properly copy default flag reg for 3src instrucitons
>    intel/fs: Add the group to the flag subreg number on SNB and older
>    intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
>    intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
>    intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
>    intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
>    intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
>    intel/fs: Remove support push constants in repclear shaders
>    intel/fs: Support SIMD32 repclear shaders
> 
>   src/intel/blorp/blorp.c                       |   2 +-
>   src/intel/blorp/blorp_genX_exec.h             |  82 +++-
>   src/intel/compiler/brw_compiler.h             |  98 +++-
>   src/intel/compiler/brw_eu.h                   |  21 +-
>   src/intel/compiler/brw_eu_defines.h           |   2 -
>   src/intel/compiler/brw_eu_emit.c              |  39 +-
>   src/intel/compiler/brw_fs.cpp                 | 666 ++++++++++++++++----------
>   src/intel/compiler/brw_fs.h                   |  53 +-
>   src/intel/compiler/brw_fs_builder.h           |   6 +-
>   src/intel/compiler/brw_fs_cse.cpp             |   1 -
>   src/intel/compiler/brw_fs_generator.cpp       | 318 ++++++------
>   src/intel/compiler/brw_fs_nir.cpp             |  57 ++-
>   src/intel/compiler/brw_fs_visitor.cpp         | 193 ++++----
>   src/intel/compiler/brw_ir_fs.h                |   1 +
>   src/intel/compiler/brw_shader.cpp             |  12 +-
>   src/intel/compiler/brw_vec4.cpp               |   2 +-
>   src/intel/compiler/brw_vec4_gs_visitor.cpp    |   2 +-
>   src/intel/compiler/brw_vec4_tcs.cpp           |   2 +-
>   src/intel/compiler/brw_wm_iz.cpp              |  11 +-
>   src/intel/vulkan/anv_pipeline.c               |   2 +-
>   src/intel/vulkan/genX_pipeline.c              |  40 +-
>   src/mesa/drivers/dri/i965/brw_context.h       |   1 +
>   src/mesa/drivers/dri/i965/brw_program.c       |   6 +
>   src/mesa/drivers/dri/i965/brw_wm.c            |   6 +-
>   src/mesa/drivers/dri/i965/gen4_blorp_exec.h   |  17 +-
>   src/mesa/drivers/dri/i965/genX_state_upload.c | 144 ++++--
>   26 files changed, 1101 insertions(+), 683 deletions(-)
>