[Mesa-dev] [PATCH 1/2] intel/fs: Implement GRF bank conflict mitigation pass.

Thu Dec 7 18:10:27 UTC 2017

On Thu, Jun 22, 2017 at 12:20 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Unnecessary GRF bank conflicts increase the issue time of ternary
> instructions (the overwhelmingly most common of which is MAD) by
> roughly 50%, leading to reduced ALU throughput.  This pass attempts to
> minimize the number of bank conflicts by rearranging the layout of the
> GRF space post-register allocation.  It's in general not possible to
> eliminate all of them without introducing extra copies, which are
> typically more expensive than the bank conflict itself.
>
> In a shader-db run on SKL this helps roughly 46k shaders:
>
>    total conflicts in shared programs: 1008981 -> 600461 (-40.49%)
>    conflicts in affected programs: 816222 -> 407702 (-50.05%)
>    helped: 46234
>    HURT: 72
>
> The running time of shader-db itself on SKL seems to be increased by
> roughly 2.52%±1.13% with n=20 due to the additional work done by the
> compiler back-end.
>
> On earlier generations the pass is somewhat less effective in relative
> terms because the hardware incurs a bank conflict anytime the last two
> sources of the instruction are duplicate (e.g. while trying to square
> a value using MAD), which is impossible to avoid without introducing
> copies.  E.g. for a shader-db run on SNB:
>
>    total conflicts in shared programs: 944636 -> 623185 (-34.03%)
>    conflicts in affected programs: 853258 -> 531807 (-37.67%)
>    helped: 31052
>    HURT: 19
>
> And on BDW:
>
>    total conflicts in shared programs: 1418393 -> 987539 (-30.38%)
>    conflicts in affected programs: 1179787 -> 748933 (-36.52%)
>    helped: 47592
>    HURT: 70
>
> On SKL GT4e this improves performance of GpuTest Volplosion by 3.64%
> ±0.33% with n=16.
>
> NOTE: This patch intentionally disregards some i965 coding conventions
>       for the sake of reviewability.  This is addressed by the next
>       squash patch which introduces an amount of (for the most part
>       boring) boilerplate that might distract reviewers from the
>       non-trivial algorithmic details of the pass.
> ---
>  src/intel/Makefile.sources                   |   1 +
>  src/intel/compiler/brw_fs.cpp                |   2 +
>  src/intel/compiler/brw_fs.h                  |   1 +
>  src/intel/compiler/brw_fs_bank_conflicts.cpp | 791 +++++++++++++++++++++++++++
>  4 files changed, 795 insertions(+)
>  create mode 100644 src/intel/compiler/brw_fs_bank_conflicts.cpp
>
> diff --git a/src/intel/Makefile.sources b/src/intel/Makefile.sources
> index a877ff2..1b9799a 100644
> --- a/src/intel/Makefile.sources
> +++ b/src/intel/Makefile.sources
> @@ -44,6 +44,7 @@ COMPILER_FILES = \
>         compiler/brw_eu_util.c \
>         compiler/brw_eu_validate.c \
>         compiler/brw_fs_builder.h \
> +        compiler/brw_fs_bank_conflicts.cpp \

Use tabs in Makefiles

>         compiler/brw_fs_cmod_propagation.cpp \
>         compiler/brw_fs_combine_constants.cpp \
>         compiler/brw_fs_copy_propagation.cpp \
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 43b6e34..0a85c0c 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -5858,6 +5858,8 @@ fs_visitor::allocate_registers(bool allow_spilling)
>     if (failed)
>        return;
>
> +   opt_bank_conflicts();
> +
>     schedule_instructions(SCHEDULE_POST);
>
>     if (last_scratch > 0) {
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 6c8c027..b1fc7b3 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -141,6 +141,7 @@ public:
>                                     exec_list *acp);
>     bool opt_drop_redundant_mov_to_flags();
>     bool opt_register_renaming();
> +   bool opt_bank_conflicts();
>     bool register_coalesce();
>     bool compute_to_mrf();
>     bool eliminate_find_live_channel();
> diff --git a/src/intel/compiler/brw_fs_bank_conflicts.cpp b/src/intel/compiler/brw_fs_bank_conflicts.cpp
> new file mode 100644
> index 0000000..0225c70
> --- /dev/null
> +++ b/src/intel/compiler/brw_fs_bank_conflicts.cpp
> @@ -0,0 +1,791 @@
> +/*
> + * Copyright © 2017 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + */
> +
> +/** @file brw_fs_bank_conflicts.cpp
> + *
> + * This file contains a GRF bank conflict mitigation pass.  The pass is
> + * intended to be run after register allocation and works by rearranging the
> + * layout of the GRF space (hopefully without altering the semantics of the
> + * program) in a way that minimizes the number of GRF bank conflicts incurred

I'd remove the parenthetical. The pass better not change the semantics
of the program!

> + * by ternary instructions.
> + *
> + * Unfortunately there is close to no information about bank conflicts in the
> + * hardware spec, but experimentally on Gen7-Gen9 ternary instructions seem to
> + * incur an average bank conflict penalty of one cycle per SIMD8 op whenever
> + * the second and third source are stored in the same GRF bank (\sa bank_of()
> + * for the exact bank layout) which cannot be fetched during the same cycle by
> + * the EU, unless the EU logic manages to optimize out the read cycle of a
> + * duplicate source register (\sa is_conflict_optimized_out()).
> + *
> + * The asymptotic run-time of the algorithm is dominated by the
> + * shader_conflict_weight_matrix() computation below, which is O(n) on the
> + * number of instructions in the program, however for small and medium-sized
> + * programs the run-time is likely to be dominated by
> + * optimize_reg_permutation() which is O(m^3) on the number of GRF atoms of
> + * the program (\sa partitioning), which is bounded (since the program uses a
> + * bounded number of registers post-regalloc) and of the order of 100.  For
> + * that reason optimize_reg_permutation() is vectorized in order to keep the
> + * cubic term within reasonable bounds for m close to its theoretical maximum.
> + */
> +
> +#include "brw_fs.h"
> +#include "brw_cfg.h"
> +
> +#include <vector>
> +#include <array>
> +
> +#ifdef __SSE2__
> +
> +#include <emmintrin.h>
> +
> +/**
> + * Thin layer around vector intrinsics so they can be easily replaced with
> + * e.g. the fall-back scalar path, an implementation with different vector
> + * width or using different SIMD architectures (AVX-512?!).
> + *
> + * This implementation operates on pairs of independent SSE2 integer vectors à
> + * la SIMD16 for somewhat improved throughput.  SSE2 is supported by virtually
> + * all platforms that care about bank conflicts, so this path should almost

SSE2 is available on all processors that can be paired with a graphics
core supported by i965. We build i965 with -msse2 now (this wasn't the
case when you sent the patch initially). I'm fine with leaving the
#ifdef __SSE2__ in the code if the other path might help with
debugging someday.