[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

Fri Jun 1 10:53:20 UTC 2018

Hi,

On 30.05.2018 17:30, Jason Ekstrand wrote:
> On May 30, 2018 06:45:29 Eero Tamminen <eero.t.tamminen at intel.com> wrote:
>> On 29.05.2018 18:58, Eero Tamminen wrote:
>>> On 25.05.2018 00:55, Jason Ekstrand wrote:
>>>> This patch series adds back-end compiler support for SIMD32 fragment
>>>> shaders.  Support is added and everything works but it's currently 
>>>> hidden
>>>> behind INTEL_DEBUG=do32.  We know that it improves performance in some
>>>> cases but we do not yet have a good enough heuristic to start turning
>>>> it on
>>>> by default.  The objective of this series is to just to get the 
>>>> compiler
>>>> infrastructure landed so that it stops bit-rotting in Curro's branch.
>>>
>>> Tested v3 on BXT & SKL.  Everything seems to work fine.
>>
>> Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests
>> invoke GPU hangs on GEN7 (BYT & HSW GT2).
> 
> That problem is known.  It's caused by using SIMD32 shaders for fast 
> clears.  The SIMD32 replicated clear shaders were added on at the last 
> minute and didn't get good enough testing before sending out the series. 
> We can either drop those two patches and modify the last one to not do 
> SIMD32 when use_replicated_clear is set or I have another patch which 
> just disables SIMD32 for fast clears.

AFAIK plain copy & write shaders (like clear) don't benefit from SIMD32. 
  They are 100% bottlenecked by input/output bandwidth already with 
SIMD16, so instruction scheduling latency improvement can't help.  At 
worst SIMD32 can make them slower, if it causes extra cache trashing.

	- Eero

>> One option would be to support SIMD32 just for GEN8+.
>>
>>
>>> Tested-by Eero Tamminen <eero.t.tamminen at intel.com>
>>>
>>>
>>>> Figuring out a good heuristic is left as an exercise to the reader. :-)
>>>
>>> Simple heuristic that just enables SIMD32 for everything that isn't
>>> MRT shader, gives nice perf improvements on BXT J4205:
>>> * +30% GfxBench ALU2
>>> * +25% SynMark PSPom
>>> * +10% GpuTest Julia32
>>> * +9% GfxBench CarChase
>>> * +7% GfxBench Manhattan 3.0
>>> * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
>>> * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
>>> * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,
>>> VSInstancing & ZBuffer
>>> * -2-3% GLB 2.7 Fill
>>> * -4-5% MemBW Blend
>>>
>>> On SKL, perf differences are smaller.
>>
>> On GEN8, the improvements are smaller and regressions larger with
>> the same heuristic.
>>
>> Main difference with the 12EU single channel BSW, is -15% regression
>> in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing
>> out their average value.  With single-channel memory, increased memory
>> latency causes a lot more trashing with SIMD32 when many textures are
>> being sampled close together.
>>
>>
>>> SIMD32 can cause write bound tests to trash, which is visible as perf
>>> regression in fully write bound tests above (that's also the reason
>>> why SIMD32 is good to disable with MRT shaders).
>>>
>>> As to reads, SIMD32 improves cache locality until it starts trashing.
>>> In above GfxBench tests, and amount of texture sampling they do, this
>>> shows in HW counters as increased texture cache misses (trashing), but
>>> less L3 misses (better locality).  Along with (more important) better
>>> latency compensation, these explain why SIMD32 improves performance in
>>> them.
>>>
>>>
>>> More advanced heuristics that try to avoid the SIMD32 performance
>>> regressions, unfortunately also get rid of clear part of the above
>>> improvements.  Such heuristics would need improved instruction scheduler
>>
>> Heuristics for things affecting texture fetch latencies would help, like
>> how many fetches there are, to how many different textures and how close
>> together they are vs. how large caches there are and how fast RAM.
>>
>>
>> - Eero
>>
>>> that provides feedback on which shaders have latency issues where SIMD32
>>> would help.
>>>
>>> (A potential run-time heuristics would be disabling SIMD32 when too
>>> large textures are bound for draw.)
>>>
>>>
>>>    - Eero
>>>
>>>> Francisco Jerez (34):
>>>>   intel/eu: Remove brw_codegen::compressed_stack.
>>>>   intel/fs: Rename a local variable so it doesn't shadow component()
>>>>   intel/fs: Use the ATTR file for FS inputs
>>>>   intel/fs: Replace the CINTERP opcode with a simple MOV
>>>>   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
>>>>   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
>>>>     writes.
>>>>   intel/eu: Return new instruction to caller from brw_fb_WRITE().
>>>>   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
>>>>   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
>>>>   intel/fs: Remove program key argument from generator.
>>>>   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
>>>>   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
>>>>   intel/eu: Fix pixel interpolator queries for SIMD32.
>>>>   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
>>>>   intel/fs: Don't enable dual source blend if no outputs are written
>>>>   intel/fs: Fix FB write message control codegen for SIMD32.
>>>>   intel/fs: Fix logical FB write lowering for SIMD32
>>>>   intel/fs: Fix FB read header setup for SIMD32.
>>>>   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
>>>>   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
>>>>     pre-Gen7.
>>>>   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
>>>>   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
>>>>   intel/fs: Simplify fs_visitor::emit_samplepos_setup
>>>>   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
>>>>     workaround
>>>>   intel/fs: Wrap FS payload register look-up in a helper function.
>>>>   intel/fs: Extend thread payload layout to SIMD32
>>>>   intel/fs: Implement 32-wide FS payload setup on Gen6+
>>>>   intel/fs: Fix Gen7 compressed source region alignment restriction for
>>>>     SIMD32
>>>>   intel/fs: Fix sample id setup for SIMD32.
>>>>   intel/fs: Generalize the unlit centroid workaround
>>>>   intel/fs: Fix Gen6+ interpolation setup for SIMD32
>>>>   intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
>>>>   intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
>>>>   intel/fs: Build 32-wide FS shaders.
>>>>
>>>> Jason Ekstrand (19):
>>>>   intel/fs: Assert that the gen4-6 plane restrictions are followed
>>>>   intel/fs: Use groups for SIMD16 LINTERP on gen11+
>>>>   intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
>>>>   intel/fs: Properly track implied header regs read by FB writes
>>>>   intel/fs: Pull FB write implied headers from src[0]
>>>>   intel/fs: Set up FB write message headers in the visitor
>>>>   i965: Re-arrange shader kernel setup in WM state
>>>>   intel/compiler: Add and use helpers for working with KSP indices
>>>>   intel/fs: Rework KSP data to be SIMD width-based
>>>>   intel/fs: Split instructions low to high in lower_simd_width
>>>>   intel/fs: Properly copy default flag reg for 3src instrucitons
>>>>   intel/fs: Add the group to the flag subreg number on SNB and older
>>>>   intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
>>>>   intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
>>>>   intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
>>>>   intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
>>>>   intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
>>>>   intel/fs: Remove support push constants in repclear shaders
>>>>   intel/fs: Support SIMD32 repclear shaders
>>>>
>>>>  src/intel/blorp/blorp.c                       |   2 +-
>>>>  src/intel/blorp/blorp_genX_exec.h             |  82 +++-
>>>>  src/intel/compiler/brw_compiler.h             |  98 +++-
>>>>  src/intel/compiler/brw_eu.h                   |  21 +-
>>>>  src/intel/compiler/brw_eu_defines.h           |   2 -
>>>>  src/intel/compiler/brw_eu_emit.c              |  39 +-
>>>>  src/intel/compiler/brw_fs.cpp                 | 666
>>>> ++++++++++++++++----------
>>>>  src/intel/compiler/brw_fs.h                   |  53 +-
>>>>  src/intel/compiler/brw_fs_builder.h           |   6 +-
>>>>  src/intel/compiler/brw_fs_cse.cpp             |   1 -
>>>>  src/intel/compiler/brw_fs_generator.cpp       | 318 ++++++------
>>>>  src/intel/compiler/brw_fs_nir.cpp             |  57 ++-
>>>>  src/intel/compiler/brw_fs_visitor.cpp         | 193 ++++----
>>>>  src/intel/compiler/brw_ir_fs.h                |   1 +
>>>>  src/intel/compiler/brw_shader.cpp             |  12 +-
>>>>  src/intel/compiler/brw_vec4.cpp               |   2 +-
>>>>  src/intel/compiler/brw_vec4_gs_visitor.cpp    |   2 +-
>>>>  src/intel/compiler/brw_vec4_tcs.cpp           |   2 +-
>>>>  src/intel/compiler/brw_wm_iz.cpp              |  11 +-
>>>>  src/intel/vulkan/anv_pipeline.c               |   2 +-
>>>>  src/intel/vulkan/genX_pipeline.c              |  40 +-
>>>>  src/mesa/drivers/dri/i965/brw_context.h       |   1 +
>>>>  src/mesa/drivers/dri/i965/brw_program.c       |   6 +
>>>>  src/mesa/drivers/dri/i965/brw_wm.c            |   6 +-
>>>>  src/mesa/drivers/dri/i965/gen4_blorp_exec.h   |  17 +-
>>>>  src/mesa/drivers/dri/i965/genX_state_upload.c | 144 ++++--
>>>>  26 files changed, 1101 insertions(+), 683 deletions(-)
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
>