[Mesa-dev] [PATCH 1/2] intel/fs: Implement GRF bank conflict mitigation pass.

Tue Dec 12 13:13:36 UTC 2017

Hi,

On 11.12.2017 12:28, Eero Tamminen wrote:
> Thanks for finally having this handled in Mesa!
> 
> This patch series, live intervals and "Don't let undefined values 
> prevent copy propagation" commits help performance in following tests:
> * GfxBench CarChase (2% by live intervals)

That was on SKL GT2.

On BXT J4205, it (or the whole set of commits) improved CarChase by 4-5%!

> * GfxBench AztecRuins & Manhattan 3.0 (very marginally)
> * GfxBench Tessellation & ALU (not ALU2)
> * GpuTest Volplosion & Julia FP64 (maybe also FurMark)
> * SynMark CSDof (2-3% by copy propagation)
> * SynMark PSPom (1% by live intervals)
> 
> Most visible improvements are on (all) GEN9+ platforms, but several of 
> them are visible also on earlier GENS.
> 
> 
> Shader compilation speed (in SynMark DrvShComp) drops by ~10%, mainly 
> from the the copy propagation commit.
> 
> 
> Live intervals commit may have introduced small rendering regression in 
> DOTA2 (Vulkan version), I'll check that next.

I wasn't able to reproduce that with real games, so I assume it to be an 
issue with Vulkan trace / replay used for the render validation (trace 
replay showed a lot of barrier errors with Vulkan API validation).

-> Seems I need to update our DOTA2 (and other games) validation traces 
after updating vktrace/replay.  Hopefully new versions work better with 
latest Mesa code.

	- Eero

> On 06.12.2017 22:38, Francisco Jerez wrote:
>> This series (which is ready for production and improves the cycle count
>> of over 46k shaders) has been sitting here for nearly half a year.  I'm
>> planning to self-review it and land it (along with PATCH 3/2 I just sent
>> to make sure we keep regressions under control) if nobody else does in
>> the next two weeks.
>>
>> Francisco Jerez <currojerez at riseup.net> writes:
>>
>>> Unnecessary GRF bank conflicts increase the issue time of ternary
>>> instructions (the overwhelmingly most common of which is MAD) by
>>> roughly 50%, leading to reduced ALU throughput.  This pass attempts to
>>> minimize the number of bank conflicts by rearranging the layout of the
>>> GRF space post-register allocation.  It's in general not possible to
>>> eliminate all of them without introducing extra copies, which are
>>> typically more expensive than the bank conflict itself.
>>>
>>> In a shader-db run on SKL this helps roughly 46k shaders:
>>>
>>>     total conflicts in shared programs: 1008981 -> 600461 (-40.49%)
>>>     conflicts in affected programs: 816222 -> 407702 (-50.05%)
>>>     helped: 46234
>>>     HURT: 72
>>>
>>> The running time of shader-db itself on SKL seems to be increased by
>>> roughly 2.52%±1.13% with n=20 due to the additional work done by the
>>> compiler back-end.
>>>
>>> On earlier generations the pass is somewhat less effective in relative
>>> terms because the hardware incurs a bank conflict anytime the last two
>>> sources of the instruction are duplicate (e.g. while trying to square
>>> a value using MAD), which is impossible to avoid without introducing
>>> copies.  E.g. for a shader-db run on SNB:
>>>
>>>     total conflicts in shared programs: 944636 -> 623185 (-34.03%)
>>>     conflicts in affected programs: 853258 -> 531807 (-37.67%)
>>>     helped: 31052
>>>     HURT: 19
>>>
>>> And on BDW:
>>>
>>>     total conflicts in shared programs: 1418393 -> 987539 (-30.38%)
>>>     conflicts in affected programs: 1179787 -> 748933 (-36.52%)
>>>     helped: 47592
>>>     HURT: 70
>>>
>>> On SKL GT4e this improves performance of GpuTest Volplosion by 3.64%
>>> ±0.33% with n=16.
>>>
>>> NOTE: This patch intentionally disregards some i965 coding conventions
>>>        for the sake of reviewability.  This is addressed by the next
>>>        squash patch which introduces an amount of (for the most part
>>>        boring) boilerplate that might distract reviewers from the
>>>        non-trivial algorithmic details of the pass.
>>> ---
>>>   src/intel/Makefile.sources                   |   1 +
>>>   src/intel/compiler/brw_fs.cpp                |   2 +
>>>   src/intel/compiler/brw_fs.h                  |   1 +
>>>   src/intel/compiler/brw_fs_bank_conflicts.cpp | 791 
>>> +++++++++++++++++++++++++++
>>>   4 files changed, 795 insertions(+)
>>>   create mode 100644 src/intel/compiler/brw_fs_bank_conflicts.cpp
>>>
>>> diff --git a/src/intel/Makefile.sources b/src/intel/Makefile.sources
>>> index a877ff2..1b9799a 100644
>>> --- a/src/intel/Makefile.sources
>>> +++ b/src/intel/Makefile.sources
>>> @@ -44,6 +44,7 @@ COMPILER_FILES = \
>>>       compiler/brw_eu_util.c \
>>>       compiler/brw_eu_validate.c \
>>>       compiler/brw_fs_builder.h \
>>> +        compiler/brw_fs_bank_conflicts.cpp \
>>>       compiler/brw_fs_cmod_propagation.cpp \
>>>       compiler/brw_fs_combine_constants.cpp \
>>>       compiler/brw_fs_copy_propagation.cpp \
>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>> b/src/intel/compiler/brw_fs.cpp
>>> index 43b6e34..0a85c0c 100644
>>> --- a/src/intel/compiler/brw_fs.cpp
>>> +++ b/src/intel/compiler/brw_fs.cpp
>>> @@ -5858,6 +5858,8 @@ fs_visitor::allocate_registers(bool 
>>> allow_spilling)
>>>      if (failed)
>>>         return;
>>> +   opt_bank_conflicts();
>>> +
>>>      schedule_instructions(SCHEDULE_POST);
>>>      if (last_scratch > 0) {
>>> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
>>> index 6c8c027..b1fc7b3 100644
>>> --- a/src/intel/compiler/brw_fs.h
>>> +++ b/src/intel/compiler/brw_fs.h
>>> @@ -141,6 +141,7 @@ public:
>>>                                      exec_list *acp);
>>>      bool opt_drop_redundant_mov_to_flags();
>>>      bool opt_register_renaming();
>>> +   bool opt_bank_conflicts();
>>>      bool register_coalesce();
>>>      bool compute_to_mrf();
>>>      bool eliminate_find_live_channel();
>>> diff --git a/src/intel/compiler/brw_fs_bank_conflicts.cpp 
>>> b/src/intel/compiler/brw_fs_bank_conflicts.cpp
>>> new file mode 100644
>>> index 0000000..0225c70
>>> --- /dev/null
>>> +++ b/src/intel/compiler/brw_fs_bank_conflicts.cpp
>>> @@ -0,0 +1,791 @@
>>> +/*
>>> + * Copyright © 2017 Intel Corporation
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person 
>>> obtaining a
>>> + * copy of this software and associated documentation files (the 
>>> "Software"),
>>> + * to deal in the Software without restriction, including without 
>>> limitation
>>> + * the rights to use, copy, modify, merge, publish, distribute, 
>>> sublicense,
>>> + * and/or sell copies of the Software, and to permit persons to whom 
>>> the
>>> + * Software is furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice (including 
>>> the next
>>> + * paragraph) shall be included in all copies or substantial 
>>> portions of the
>>> + * Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
>>> EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
>>> MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO 
>>> EVENT SHALL
>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES 
>>> OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
>>> ARISING
>>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 
>>> OTHER DEALINGS
>>> + * IN THE SOFTWARE.
>>> + */
>>> +
>>> +/** @file brw_fs_bank_conflicts.cpp
>>> + *
>>> + * This file contains a GRF bank conflict mitigation pass.  The pass is
>>> + * intended to be run after register allocation and works by 
>>> rearranging the
>>> + * layout of the GRF space (hopefully without altering the semantics 
>>> of the
>>> + * program) in a way that minimizes the number of GRF bank conflicts 
>>> incurred
>>> + * by ternary instructions.
>>> + *
>>> + * Unfortunately there is close to no information about bank 
>>> conflicts in the
>>> + * hardware spec, but experimentally on Gen7-Gen9 ternary 
>>> instructions seem to
>>> + * incur an average bank conflict penalty of one cycle per SIMD8 op 
>>> whenever
>>> + * the second and third source are stored in the same GRF bank (\sa 
>>> bank_of()
>>> + * for the exact bank layout) which cannot be fetched during the 
>>> same cycle by
>>> + * the EU, unless the EU logic manages to optimize out the read 
>>> cycle of a
>>> + * duplicate source register (\sa is_conflict_optimized_out()).
>>> + *
>>> + * The asymptotic run-time of the algorithm is dominated by the
>>> + * shader_conflict_weight_matrix() computation below, which is O(n) 
>>> on the
>>> + * number of instructions in the program, however for small and 
>>> medium-sized
>>> + * programs the run-time is likely to be dominated by
>>> + * optimize_reg_permutation() which is O(m^3) on the number of GRF 
>>> atoms of
>>> + * the program (\sa partitioning), which is bounded (since the 
>>> program uses a
>>> + * bounded number of registers post-regalloc) and of the order of 
>>> 100.  For
>>> + * that reason optimize_reg_permutation() is vectorized in order to 
>>> keep the
>>> + * cubic term within reasonable bounds for m close to its 
>>> theoretical maximum.
>>> + */
>>> +
>>> +#include "brw_fs.h"
>>> +#include "brw_cfg.h"
>>> +
>>> +#include <vector>
>>> +#include <array>
>>> +
>>> +#ifdef __SSE2__
>>> +
>>> +#include <emmintrin.h>
>>> +
>>> +/**
>>> + * Thin layer around vector intrinsics so they can be easily 
>>> replaced with
>>> + * e.g. the fall-back scalar path, an implementation with different 
>>> vector
>>> + * width or using different SIMD architectures (AVX-512?!).
>>> + *
>>> + * This implementation operates on pairs of independent SSE2 integer 
>>> vectors à
>>> + * la SIMD16 for somewhat improved throughput.  SSE2 is supported by 
>>> virtually
>>> + * all platforms that care about bank conflicts, so this path should 
>>> almost
>>> + * always be available in practice.
>>> + */
>>> +namespace {
>>> +   /**
>>> +    * SIMD integer vector data type.
>>> +    */
>>> +   typedef std::array<__m128i, 2> vector_type;
>>> +
>>> +   /**
>>> +    * Scalar data type matching the representation of a single 
>>> component of \p
>>> +    * vector_type.
>>> +    */
>>> +   typedef int16_t scalar_type;
>>> +
>>> +   /**
>>> +    * Maximum integer value representable as a \p scalar_type.
>>> +    */
>>> +   const scalar_type max_scalar = INT16_MAX;
>>> +
>>> +   /**
>>> +    * Number of components of a \p vector_type.
>>> +    */
>>> +   const unsigned vector_width = 2 * sizeof(vector_type::value_type) /
>>> +                                     sizeof(scalar_type);
>>> +
>>> +   /**
>>> +    * Set the i-th component of vector \p v to \p x.
>>> +    */
>>> +   void
>>> +   set(vector_type &v, unsigned i, scalar_type x)
>>> +   {
>>> +      assert(i < vector_width);
>>> +      memcpy((char *)v.data() + i * sizeof(x), &x, sizeof(x));
>>> +   }
>>> +
>>> +   /**
>>> +    * Get the i-th component of vector \p v.
>>> +    */
>>> +   scalar_type
>>> +   get(const vector_type &v, unsigned i)
>>> +   {
>>> +      assert(i < vector_width);
>>> +      scalar_type x;
>>> +      memcpy(&x, (char *)v.data() + i * sizeof(x), sizeof(x));
>>> +      return x;
>>> +   }
>>> +
>>> +   /**
>>> +    * Add two vectors with saturation.
>>> +    */
>>> +   vector_type
>>> +   adds(const vector_type &v, const vector_type &w)
>>> +   {
>>> +      const vector_type u = {
>>> +         _mm_adds_epi16(v[0], w[0]),
>>> +         _mm_adds_epi16(v[1], w[1])
>>> +      };
>>> +      return u;
>>> +   }
>>> +
>>> +   /**
>>> +    * Subtract two vectors with saturation.
>>> +    */
>>> +   vector_type
>>> +   subs(const vector_type &v, const vector_type &w)
>>> +   {
>>> +      const vector_type u = {
>>> +         _mm_subs_epi16(v[0], w[0]),
>>> +         _mm_subs_epi16(v[1], w[1])
>>> +      };
>>> +      return u;
>>> +   }
>>> +
>>> +   /**
>>> +    * Compute the bitwise conjunction of two vectors.
>>> +    */
>>> +   vector_type
>>> +   mask(const vector_type &v, const vector_type &w)
>>> +   {
>>> +      const vector_type u = {
>>> +         _mm_and_si128(v[0], w[0]),
>>> +         _mm_and_si128(v[1], w[1])
>>> +      };
>>> +      return u;
>>> +   }
>>> +
>>> +   /**
>>> +    * Reduce the components of a vector using saturating addition.
>>> +    */
>>> +   scalar_type
>>> +   sums(const vector_type &v)
>>> +   {
>>> +      const __m128i v8 = _mm_adds_epi16(v[0], v[1]);
>>> +      const __m128i v4 = _mm_adds_epi16(v8, _mm_shuffle_epi32(v8, 
>>> 0x4e));
>>> +      const __m128i v2 = _mm_adds_epi16(v4, _mm_shuffle_epi32(v4, 
>>> 0xb1));
>>> +      const __m128i v1 = _mm_adds_epi16(v2, _mm_shufflelo_epi16(v2, 
>>> 0xb1));
>>> +      return _mm_extract_epi16(v1, 0);
>>> +   }
>>> +}
>>> +
>>> +#else
>>> +
>>> +/**
>>> + * Thin layer around vector intrinsics so they can be easily 
>>> replaced with
>>> + * e.g. the fall-back scalar path, an implementation with different 
>>> vector
>>> + * width or using different SIMD architectures (AVX-512?!).
>>> + *
>>> + * This implementation operates on scalar values and doesn't rely on
>>> + * any vector extensions.  This is mainly intended for debugging and
>>> + * to keep this file building on exotic platforms.
>>> + */
>>> +namespace {
>>> +   /**
>>> +    * SIMD integer vector data type.
>>> +    */
>>> +   typedef int16_t vector_type;
>>> +
>>> +   /**
>>> +    * Scalar data type matching the representation of a single 
>>> component of \p
>>> +    * vector_type.
>>> +    */
>>> +   typedef int16_t scalar_type;
>>> +
>>> +   /**
>>> +    * Maximum integer value representable as a \p scalar_type.
>>> +    */
>>> +   const scalar_type max_scalar = INT16_MAX;
>>> +
>>> +   /**
>>> +    * Number of components of a \p vector_type.
>>> +    */
>>> +   const unsigned vector_width = 1;
>>> +
>>> +   /**
>>> +    * Set the i-th component of vector \p v to \p x.
>>> +    */
>>> +   void
>>> +   set(vector_type &v, unsigned i, scalar_type x)
>>> +   {
>>> +      assert(i < vector_width);
>>> +      v = x;
>>> +   }
>>> +
>>> +   /**
>>> +    * Get the i-th component of vector \p v.
>>> +    */
>>> +   scalar_type
>>> +   get(const vector_type &v, unsigned i)
>>> +   {
>>> +      assert(i < vector_width);
>>> +      return v;
>>> +   }
>>> +
>>> +   /**
>>> +    * Add two vectors with saturation.
>>> +    */
>>> +   vector_type
>>> +   adds(vector_type v, vector_type w)
>>> +   {
>>> +      return std::max(INT16_MIN, std::min(INT16_MAX, int(v) + w));
>>> +   }
>>> +
>>> +   /**
>>> +    * Substract two vectors with saturation.
>>> +    */
>>> +   vector_type
>>> +   subs(vector_type v, vector_type w)
>>> +   {
>>> +      return std::max(INT16_MIN, std::min(INT16_MAX, int(v) - w));
>>> +   }
>>> +
>>> +   /**
>>> +    * Compute the bitwise conjunction of two vectors.
>>> +    */
>>> +   vector_type
>>> +   mask(vector_type v, vector_type w)
>>> +   {
>>> +      return v & w;
>>> +   }
>>> +
>>> +   /**
>>> +    * Reduce the components of a vector using saturating addition.
>>> +    */
>>> +   scalar_type
>>> +   sums(vector_type v)
>>> +   {
>>> +      return v;
>>> +   }
>>> +}
>>> +
>>> +#endif
>>> +
>>> +namespace {
>>> +   /**
>>> +    * Variable-length vector type intended to represent cycle-count 
>>> costs for
>>> +    * arbitrary atom-to-bank assignments.  It's indexed by a pair of 
>>> integers
>>> +    * (i, p), where i is an atom index and p in {0, 1} indicates the 
>>> parity of
>>> +    * the conflict (respectively, whether the cost is incurred 
>>> whenever the
>>> +    * atoms are assigned the same bank b or opposite-parity banks b 
>>> and b^1).
>>> +    * \sa shader_conflict_weight_matrix()
>>> +    */
>>> +   typedef std::vector<vector_type> weight_vector_type;
>>> +
>>> +   /**
>>> +    * Set the (i, p)-th component of weight vector \p v to \p x.
>>> +    */
>>> +   void
>>> +   set(weight_vector_type &v, unsigned i, unsigned p, scalar_type x)
>>> +   {
>>> +      set(v[(2 * i + p) / vector_width], (2 * i + p) % vector_width, 
>>> x);
>>> +   }
>>> +
>>> +   /**
>>> +    * Get the (i, p)-th component of weight vector \p v.
>>> +    */
>>> +   scalar_type
>>> +   get(const weight_vector_type &v, unsigned i, unsigned p)
>>> +   {
>>> +      return get(v[(2 * i + p) / vector_width], (2 * i + p) % 
>>> vector_width);
>>> +   }
>>> +
>>> +   /**
>>> +    * Swap the (i, p)-th and (j, q)-th components of weight vector 
>>> \p v.
>>> +    */
>>> +   void
>>> +   swap(weight_vector_type &v,
>>> +        unsigned i, unsigned p,
>>> +        unsigned j, unsigned q)
>>> +   {
>>> +      const scalar_type tmp = get(v, i, p);
>>> +      set(v, i, p, get(v, j, q));
>>> +      set(v, j, q, tmp);
>>> +   }
>>> +}
>>> +
>>> +namespace {
>>> +   /**
>>> +    * Object that represents the partitioning of an arbitrary 
>>> register space
>>> +    * into indivisible units (referred to as atoms below) that can 
>>> potentially
>>> +    * be rearranged independently from other registers.  The 
>>> partitioning is
>>> +    * inferred from a number of contiguity requirements specified using
>>> +    * require_contiguous().  This allows efficient look-up of the 
>>> atom index a
>>> +    * given register address belongs to, or conversely the range of 
>>> register
>>> +    * addresses that belong to a given atom.
>>> +    */
>>> +   struct partitioning {
>>> +      /**
>>> +       * Create a (for the moment unrestricted) partitioning of a 
>>> register
>>> +       * file of size \p n.  The units are arbitrary.
>>> +       */
>>> +      partitioning(unsigned n) {
>>> +         for (unsigned i = 0; i < n + num_terminator_atoms; i++) {
>>> +            offsets.push_back(i);
>>> +            atoms.push_back(i);
>>> +         }
>>> +      }
>>> +
>>> +      /**
>>> +       * Require register range [reg, reg + n[ to be considered part 
>>> of the
>>> +       * same atom.
>>> +       */
>>> +      void
>>> +      require_contiguous(unsigned reg, unsigned n)
>>> +      {
>>> +         unsigned r = atoms[reg];
>>> +
>>> +         /* Renumber atoms[reg...] = { r... } and their 
>>> offsets[r...] for the
>>> +          * case that the specified contiguity requirement leads to 
>>> the fusion
>>> +          * (yay) of one or more existing atoms.
>>> +          */
>>> +         for (unsigned reg1 = reg + 1; reg1 < atoms.size(); reg1++) {
>>> +            if (offsets[atoms[reg1]] < reg + n) {
>>> +               atoms[reg1] = r;
>>> +            } else {
>>> +               if (offsets[atoms[reg1 - 1]] != offsets[atoms[reg1]])
>>> +                  r++;
>>> +
>>> +               offsets[r] = offsets[atoms[reg1]];
>>> +               atoms[reg1] = r;
>>> +            }
>>> +         }
>>> +
>>> +         /* Clean up the scraps if we ended up with less atoms than 
>>> we started
>>> +          * with.
>>> +          */
>>> +         offsets.erase(offsets.begin() + r + 1, offsets.end());
>>> +      }
>>> +
>>> +      /**
>>> +       * Get the atom index register address \p reg belongs to.
>>> +       */
>>> +      unsigned
>>> +      atom_of_reg(unsigned reg) const
>>> +      {
>>> +         return atoms[reg];
>>> +      }
>>> +
>>> +      /**
>>> +       * Get the base register address that belongs to atom \p r.
>>> +       */
>>> +      unsigned
>>> +      reg_of_atom(unsigned r) const
>>> +      {
>>> +         return offsets[r];
>>> +      }
>>> +
>>> +      /**
>>> +       * Get the size of atom \p r in register address units.
>>> +       */
>>> +      unsigned
>>> +      size_of_atom(unsigned r) const
>>> +      {
>>> +         assert(r < num_atoms());
>>> +         return reg_of_atom(r + 1) - reg_of_atom(r);
>>> +      }
>>> +
>>> +      /**
>>> +       * Get the number of atoms the whole register space is 
>>> partitioned into.
>>> +       */
>>> +      unsigned
>>> +      num_atoms() const
>>> +      {
>>> +         return offsets.size() - num_terminator_atoms;
>>> +      }
>>> +
>>> +   private:
>>> +      /**
>>> +       * Number of trailing atoms inserted for convenience so among 
>>> other
>>> +       * things we don't need to special-case the last element in
>>> +       * size_of_atom().
>>> +       */
>>> +      static const unsigned num_terminator_atoms = 1;
>>> +      std::vector<unsigned> offsets;
>>> +      std::vector<unsigned> atoms;
>>> +   };
>>> +
>>> +   /**
>>> +    * Only GRF sources (whether they have been register-allocated or 
>>> not) can
>>> +    * possibly incur bank conflicts.
>>> +    */
>>> +   bool
>>> +   is_grf(const fs_reg &r)
>>> +   {
>>> +      return r.file == VGRF || r.file == FIXED_GRF;
>>> +   }
>>> +
>>> +   /**
>>> +    * Register offset of \p r in GRF units.  Useful because the 
>>> representation
>>> +    * of GRFs post-register allocation is somewhat inconsistent and 
>>> depends on
>>> +    * whether the register already had a fixed GRF offset prior to 
>>> register
>>> +    * allocation or whether it was part of a VGRF allocation.
>>> +    */
>>> +   unsigned
>>> +   reg_of(const fs_reg &r)
>>> +   {
>>> +      assert(is_grf(r));
>>> +      if (r.file == VGRF)
>>> +         return r.nr + r.offset / REG_SIZE;
>>> +      else
>>> +         return reg_offset(r) / REG_SIZE;
>>> +   }
>>> +
>>> +   /**
>>> +    * Calculate the finest partitioning of the GRF space compatible 
>>> with the
>>> +    * register contiguity requirements derived from all instructions 
>>> part of
>>> +    * the program.
>>> +    */
>>> +   partitioning
>>> +   shader_reg_partitioning(const fs_visitor *v)
>>> +   {
>>> +      partitioning p(BRW_MAX_GRF);
>>> +
>>> +      foreach_block_and_inst(block, fs_inst, inst, v->cfg) {
>>> +         if (is_grf(inst->dst))
>>> +            p.require_contiguous(reg_of(inst->dst), 
>>> regs_written(inst));
>>> +
>>> +         for (int i = 0; i < inst->sources; i++) {
>>> +            if (is_grf(inst->src[i]))
>>> +               p.require_contiguous(reg_of(inst->src[i]), 
>>> regs_read(inst, i));
>>> +         }
>>> +      }
>>> +
>>> +      return p;
>>> +   }
>>> +
>>> +   /**
>>> +    * Return the set of GRF atoms that should be left untouched at 
>>> their
>>> +    * original location to avoid violating hardware or software 
>>> assumptions.
>>> +    */
>>> +   std::vector<bool>
>>> +   shader_reg_constraints(const fs_visitor *v, const partitioning &p)
>>> +   {
>>> +      std::vector<bool> constrained(p.num_atoms());
>>> +
>>> +      /* These are read implicitly by some send-message instructions 
>>> without
>>> +       * any indication at the IR level.  Assume they are unsafe to 
>>> move
>>> +       * around.
>>> +       */
>>> +      for (unsigned reg = 0; reg < 2; reg++)
>>> +         constrained[p.atom_of_reg(reg)] = true;
>>> +
>>> +      /* Assume that anything referenced via fixed GRFs is baked 
>>> into the
>>> +       * hardware's fixed-function logic and may be unsafe to move 
>>> around.
>>> +       * Also take into account the source GRF restrictions of EOT
>>> +       * send-message instructions.
>>> +       */
>>> +      foreach_block_and_inst(block, fs_inst, inst, v->cfg) {
>>> +         if (inst->dst.file == FIXED_GRF)
>>> +            constrained[p.atom_of_reg(reg_of(inst->dst))] = true;
>>> +
>>> +         for (int i = 0; i < inst->sources; i++) {
>>> +            if (inst->src[i].file == FIXED_GRF ||
>>> +                (is_grf(inst->src[i]) && inst->eot))
>>> +               constrained[p.atom_of_reg(reg_of(inst->src[i]))] = true;
>>> +         }
>>> +      }
>>> +
>>> +      return constrained;
>>> +   }
>>> +
>>> +   /**
>>> +    * Return whether the hardware will be able to prevent a bank 
>>> conflict by
>>> +    * optimizing out the read cycle of a source register.  The 
>>> formula was
>>> +    * found experimentally.
>>> +    */
>>> +   bool
>>> +   is_conflict_optimized_out(const gen_device_info *devinfo, const 
>>> fs_inst *inst)
>>> +   {
>>> +      return devinfo->gen >= 9 &&
>>> +         ((is_grf(inst->src[0]) && (reg_of(inst->src[0]) == 
>>> reg_of(inst->src[1]) ||
>>> +                                    reg_of(inst->src[0]) == 
>>> reg_of(inst->src[2]))) ||
>>> +          reg_of(inst->src[1]) == reg_of(inst->src[2]));
>>> +   }
>>> +
>>> +   /**
>>> +    * Return a matrix that allows reasonably efficient computation 
>>> of the
>>> +    * cycle-count cost of bank conflicts incurred throughout the 
>>> whole program
>>> +    * for any given atom-to-bank assignment.
>>> +    *
>>> +    * More precisely, if C_r_s_p is the result of this function, the 
>>> total
>>> +    * cost of all bank conflicts involving any given atom r can be 
>>> readily
>>> +    * recovered as follows:
>>> +    *
>>> +    *  S(B) = Sum_s_p(d_(p^B_r)_(B_s) * C_r_s_p)
>>> +    *
>>> +    * where d_i_j is the Kronecker delta, and B_r indicates the bank
>>> +    * assignment of r.  \sa delta_conflicts() for a vectorized 
>>> implementation
>>> +    * of the expression above.
>>> +    *
>>> +    * FINISHME: Teach this about the Gen10+ bank conflict rules, 
>>> which are
>>> +    *           somewhat more relaxed than on previous generations.  
>>> In the
>>> +    *           meantime optimizing based on Gen9 weights is likely 
>>> to be more
>>> +    *           helpful than not optimizing at all.
>>> +    */
>>> +   std::vector<weight_vector_type>
>>> +   shader_conflict_weight_matrix(const fs_visitor *v, const 
>>> partitioning &p)
>>> +   {
>>> +      std::vector<weight_vector_type> conflicts(p.num_atoms(),
>>> +         weight_vector_type(DIV_ROUND_UP(2 * p.num_atoms(),
>>> +                                               vector_width)));
>>> +      /* Crude approximation of the number of times the current 
>>> basic block
>>> +       * will be executed at run-time.
>>> +       */
>>> +      unsigned block_scale = 1;
>>> +
>>> +      foreach_block_and_inst(block, fs_inst, inst, v->cfg) {
>>> +         if (inst->opcode == BRW_OPCODE_DO) {
>>> +            block_scale *= 10;
>>> +
>>> +         } else if (inst->opcode == BRW_OPCODE_WHILE) {
>>> +            block_scale /= 10;
>>> +
>>> +         } else if (inst->is_3src(v->devinfo) &&
>>> +                    is_grf(inst->src[1]) && is_grf(inst->src[2])) {
>>> +            const unsigned r = p.atom_of_reg(reg_of(inst->src[1]));
>>> +            const unsigned s = p.atom_of_reg(reg_of(inst->src[2]));
>>> +
>>> +            /* Estimate of the cycle-count cost of incurring a bank 
>>> conflict
>>> +             * for this instruction.  This is only true on the 
>>> average, for a
>>> +             * sequence of back-to-back ternary instructions, since 
>>> the EU
>>> +             * front-end only seems to be able to issue a new 
>>> instruction at
>>> +             * an even cycle.  The cost of a bank conflict incurred 
>>> by an
>>> +             * isolated ternary instruction may be higher.
>>> +             */
>>> +            const unsigned exec_size = 
>>> inst->dst.component_size(inst->exec_size);
>>> +            const unsigned cycle_scale = block_scale * 
>>> DIV_ROUND_UP(exec_size,
>>> +                                                                    
>>> REG_SIZE);
>>> +
>>> +            /* Neglect same-atom conflicts (since they're either 
>>> trivial or
>>> +             * impossible to avoid without splitting the atom), and 
>>> conflicts
>>> +             * known to be optimized out by the hardware.
>>> +             */
>>> +            if (r != s && !is_conflict_optimized_out(v->devinfo, 
>>> inst)) {
>>> +               /* Calculate the parity of the sources relative to 
>>> the start of
>>> +                * their respective atoms.  If their parity is the 
>>> same (and
>>> +                * none of the atoms straddle the 2KB mark), the 
>>> instruction
>>> +                * will incur a conflict iff both atoms are assigned 
>>> the same
>>> +                * bank b.  If their parity is opposite, the 
>>> instruction will
>>> +                * incur a conflict iff they are assigned opposite 
>>> banks (b and
>>> +                * b^1).
>>> +                */
>>> +               const bool p_r = 1 & (reg_of(inst->src[1]) - 
>>> p.reg_of_atom(r));
>>> +               const bool p_s = 1 & (reg_of(inst->src[2]) - 
>>> p.reg_of_atom(s));
>>> +               const unsigned p = p_r ^ p_s;
>>> +
>>> +               /* Calculate the updated cost of a hypothetical conflict
>>> +                * between atoms r and s.  Note that the weight 
>>> matrix is
>>> +                * symmetric with respect to indices r and s by 
>>> construction.
>>> +                */
>>> +               const scalar_type w = std::min(unsigned(max_scalar),
>>> +                  get(conflicts[r], s, p) + cycle_scale);
>>> +               set(conflicts[r], s, p, w);
>>> +               set(conflicts[s], r, p, w);
>>> +            }
>>> +         }
>>> +      }
>>> +
>>> +      return conflicts;
>>> +   }
>>> +
>>> +   /**
>>> +    * Return the set of GRF atoms that could potentially lead to bank
>>> +    * conflicts if laid out unfavorably in the GRF space according to
>>> +    * the specified \p conflicts matrix (\sa
>>> +    * shader_conflict_weight_matrix()).
>>> +    */
>>> +   std::vector<bool>
>>> +   have_any_conflicts(const std::vector<weight_vector_type> &conflicts)
>>> +   {
>>> +      std::vector<bool> any_conflicts(conflicts.size());
>>> +
>>> +      for (unsigned r = 0; r < conflicts.size(); r++) {
>>> +         for (unsigned s = 0; s < conflicts[r].size(); s++)
>>> +            any_conflicts[r] = any_conflicts[r] || 
>>> sums(conflicts[r][s]);
>>> +      }
>>> +
>>> +      return any_conflicts;
>>> +   }
>>> +
>>> +   /**
>>> +    * Calculate the difference between two S(B) cost estimates as 
>>> defined
>>> +    * above (\sa shader_conflict_weight_matrix()).  This represents the
>>> +    * (partial) cycle-count benefit from moving an atom r from bank 
>>> p to n.
>>> +    * The respective bank assignments Bp and Bn are encoded as the \p
>>> +    * bank_mask_p and \p bank_mask_n bitmasks for efficient 
>>> computation,
>>> +    * according to the formula:
>>> +    *
>>> +    *  bank_mask(B)_s_p = -d_(p^B_r)_(B_s)
>>> +    *
>>> +    * Notice the similarity with the delta function in the S(B) 
>>> expression
>>> +    * above, and how bank_mask(B) can be precomputed for every possible
>>> +    * selection of r since bank_mask(B) only depends on it via B_r 
>>> that may
>>> +    * only assume one of four different values, so the caller can 
>>> keep every
>>> +    * possible bank_mask(B) vector in memory without much hassle (\sa
>>> +    * bank_characteristics()).
>>> +    */
>>> +   int
>>> +   delta_conflicts(const weight_vector_type &bank_mask_p,
>>> +                   const weight_vector_type &bank_mask_n,
>>> +                   const weight_vector_type &conflicts)
>>> +   {
>>> +      vector_type s_p = {}, s_n = {};
>>> +
>>> +      for (unsigned r = 0; r < conflicts.size(); r++) {
>>> +         s_p = adds(s_p, mask(bank_mask_p[r], conflicts[r]));
>>> +         s_n = adds(s_n, mask(bank_mask_n[r], conflicts[r]));
>>> +      }
>>> +
>>> +      return sums(subs(s_p, s_n));
>>> +   }
>>> +
>>> +   /**
>>> +    * Return an identity permutation of GRF atoms, represented as 
>>> the start GRF
>>> +    * offset each atom is mapped into.
>>> +    */
>>> +   std::vector<unsigned>
>>> +   identity_reg_permutation(const partitioning &p)
>>> +   {
>>> +      std::vector<unsigned> map(p.num_atoms());
>>> +
>>> +      for (unsigned r = 0; r < map.size(); r++)
>>> +         map[r] = p.reg_of_atom(r);
>>> +
>>> +      return map;
>>> +   }
>>> +
>>> +   /**
>>> +    * Return the bank index of GRF address \p reg, numbered 
>>> according to the
>>> +    * table:
>>> +    *        Even Odd
>>> +    *    Lo    0   1
>>> +    *    Hi    2   3
>>> +    */
>>> +   unsigned
>>> +   bank_of(unsigned reg)
>>> +   {
>>> +      return (reg & 0x40) >> 5 | (reg & 1);
>>> +   }
>>> +
>>> +   /**
>>> +    * Return bitmasks suitable for use as bank mask arguments for the
>>> +    * delta_conflicts() computation.  Note that this is just the 
>>> (negative)
>>> +    * characteristic function of each bank, if you regard it as a set
>>> +    * containing all atoms assigned to it according to the \p map 
>>> array.
>>> +    */
>>> +   std::array<weight_vector_type, 4>
>>> +   bank_characteristics(const std::vector<unsigned> &map)
>>> +   {
>>> +      std::array<weight_vector_type, 4> banks;
>>> +
>>> +      for (unsigned b = 0; b < banks.size(); b++) {
>>> +         banks[b].resize(DIV_ROUND_UP(2 * map.size(), vector_width));
>>> +
>>> +         for (unsigned j = 0; j < map.size(); j++) {
>>> +            for (unsigned p = 0; p < 2; p++)
>>> +               set(banks[b], j, p,
>>> +                   (b ^ p) == bank_of(map[j]) ? -1 : 0);
>>> +         }
>>> +      }
>>> +
>>> +      return banks;
>>> +   }
>>> +
>>> +   /**
>>> +    * Return an improved permutation of GRF atoms based on \p map 
>>> attempting
>>> +    * to reduce the total cycle-count cost of bank conflicts greedily.
>>> +    *
>>> +    * Note that this doesn't attempt to merge multiple atoms into 
>>> one, which
>>> +    * may allow it to do a better job in some cases -- It simply 
>>> reorders
>>> +    * existing atoms in the GRF space without affecting their identity.
>>> +    */
>>> +   std::vector<unsigned>
>>> +   optimize_reg_permutation(const partitioning &p,
>>> +                            const std::vector<bool> &constrained,
>>> +                            const std::vector<weight_vector_type> 
>>> &conflicts,
>>> +                            std::vector<unsigned> map)
>>> +   {
>>> +      const std::vector<bool> any_conflicts = 
>>> have_any_conflicts(conflicts);
>>> +      std::array<weight_vector_type, 4> banks = 
>>> bank_characteristics(map);
>>> +
>>> +      for (unsigned r = 0; r < map.size(); r++) {
>>> +         const unsigned bank_r = bank_of(map[r]);
>>> +
>>> +         if (!constrained[r]) {
>>> +            unsigned best_s = r;
>>> +            int best_benefit = 0;
>>> +
>>> +            for (unsigned s = 0; s < map.size(); s++) {
>>> +               const unsigned bank_s = bank_of(map[s]);
>>> +
>>> +               if (bank_r != bank_s && !constrained[s] &&
>>> +                   p.size_of_atom(r) == p.size_of_atom(s) &&
>>> +                   (any_conflicts[r] || any_conflicts[s])) {
>>> +                  const int benefit =
>>> +                     delta_conflicts(banks[bank_r], banks[bank_s], 
>>> conflicts[r]) +
>>> +                     delta_conflicts(banks[bank_s], banks[bank_r], 
>>> conflicts[s]);
>>> +
>>> +                  if (benefit > best_benefit) {
>>> +                     best_s = s;
>>> +                     best_benefit = benefit;
>>> +                  }
>>> +               }
>>> +            }
>>> +
>>> +            if (best_s != r) {
>>> +               for (unsigned b = 0; b < banks.size(); b++) {
>>> +                  for (unsigned p = 0; p < 2; p++)
>>> +                     swap(banks[b], r, p, best_s, p);
>>> +               }
>>> +
>>> +               std::swap(map[r], map[best_s]);
>>> +            }
>>> +         }
>>> +      }
>>> +
>>> +      return map;
>>> +   }
>>> +
>>> +   /**
>>> +    * Apply the GRF atom permutation given by \p map to register \p 
>>> r and
>>> +    * return the result.
>>> +    */
>>> +   fs_reg
>>> +   transform(const partitioning &p, const std::vector<unsigned> &map,
>>> +             fs_reg r)
>>> +   {
>>> +      if (r.file == VGRF) {
>>> +         const unsigned reg = reg_of(r);
>>> +         const unsigned s = p.atom_of_reg(reg);
>>> +         r.nr = map[s] + reg - p.reg_of_atom(s);
>>> +         r.offset = r.offset % REG_SIZE;
>>> +      }
>>> +
>>> +      return r;
>>> +   }
>>> +}
>>> +
>>> +bool
>>> +fs_visitor::opt_bank_conflicts()
>>> +{
>>> +   assert(grf_used || !"Must be called after register allocation");
>>> +
>>> +   /* No ternary instructions -- No bank conflicts. */
>>> +   if (devinfo->gen < 6)
>>> +      return false;
>>> +
>>> +   const partitioning p = shader_reg_partitioning(this);
>>> +   const std::vector<bool> constrained = 
>>> shader_reg_constraints(this, p);
>>> +   const std::vector<weight_vector_type> conflicts =
>>> +      shader_conflict_weight_matrix(this, p);
>>> +   const std::vector<unsigned> map =
>>> +      optimize_reg_permutation(p, constrained, conflicts,
>>> +                               identity_reg_permutation(p));
>>> +
>>> +   foreach_block_and_inst(block, fs_inst, inst, cfg) {
>>> +      inst->dst = transform(p, map, inst->dst);
>>> +
>>> +      for (int i = 0; i < inst->sources; i++)
>>> +         inst->src[i] = transform(p, map, inst->src[i]);
>>> +   }
>>> +
>>> +   return true;
>>> +}
>>> -- 
>>> 2.10.2
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev