[Mesa-dev] [PATCH 00/12] nir: Add some optimizations on variables

Thu Jul 26 15:59:56 UTC 2018

This series adds some optimizations on variables to try and help shaders
with indirects where we can't just throw the variables away and use SSA.
The particular motivation of this series is the tessellation control
shaders in Batman: Arkham City as translated by DXVK.  When DXVK
translates a tessellation shader, it's common to see this pattern:

    layout(location=0) in vec3 v0[3];
    layout(location=0) in vec2 v1[3];
    layout(location=0) out vec4 oVertex[3][32];

    vec4 shader_in[3][32];

    void hs_main () {
        oVertex[gl_InvocationId][0].xyz = shader_in[gl_InvocationId][0].xyz;
        oVertex[gl_InvocationId][1].xy = shader_in[gl_InvocationId][1].xy;
        // Do some other stuff
    }

    void main () {
        shader_in[0][0].xyz = v0[0];
        shader_in[1][0].xyz = v0[1];
        shader_in[2][0].xyz = v0[2];
        shader_in[0][1].xyz = v1[0];
        shader_in[1][1].xyz = v1[1];
        shader_in[2][1].xyz = v1[2];

        hs_main();
    }

Having that shader_in temporary array is currently stops NIR's optimization
ability dead.  In anv, we end up generating a shader that first loads all
of the inputs into temporary storage and, because they are indirect, we
generate if-ladders for the reads of shader_in.  This isn't so bad in the
above example, but Batman: Arkham City has tessellation control shaders
with 8 inputs of 9 vertices each.  That many vec4's works out to 4.5 KiB of
data which is 9x the amount of storage we have per-thread in a SIMD8
shader so we end up spilling the whole lot.

This series attempts to solve this problem (and others like it) by adding
four optimizations:

 1. Structure splitting.  This isn't actually needed for this case since
    there are no structures but it's needed in order for the other passes
    to be more generally applicable.

 2. Array splitting.  This pass looks at something like the shader_in array
    above and determines that the second array index is only used directly
    and splits it into 32 arrays of vec4[3] and 30 of those arrays then get
    deleted because we never use them.

 3. Vector narrowing.  This pass looks at vectors or arrays of vectors and
    tries to determine if some of the channels are unused.  It then shrinks
    the vector and reworks all the load/store operations to swizzle things
    appropriately for the smaller type.  This way it can delete components
    from the middle of a vector.  In the example above, it takes some of
    the new vec4[3] arrays created by array splitting and shrinks them to
    vec3[3] or vec2[3].

 4. Array copy detection.  This is a peephole optimization that looks for
    a particular array copy pattern and turns it into a copy_deref
    intrinsic which copies the entire array.  This is useful because
    copy_prop_vars can see through copy_deref intrinsics and turn indirect
    loads from the destination of the copy into an indirect load of the
    source.

The end result of those four optimizations put together is that the above
example now looks something like this (after function inlining and other
optimizations):

    layout(location=0) in vec3 v0[3];
    layout(location=0) in vec2 v1[3];
    layout(location=0) out vec4 oVertex[3][32];

    vec4 shader_in[3][32];

    void main () {
        oVertex[gl_InvocationId][0].xyz = v0[gl_InvocationId].xyz;
        oVertex[gl_InvocationId][1].xy = v1[gl_InvocationId].xy;
        // Do some other stuff
    }

and we can very nicely handle the indirect per-vertex loads in the back-end
without the need for if-ladders.  The end result is that the tessellation
shaders in Batman: Arkham City no longer spill at all and are actually
readable.

Another side-effect of this series is that it potentially allows us to
vastly simplify nir_lower_vars_to_ssa.  Most of the complexity in the
vars_to_ssa pass comes with trying to handle structures, arrays, potential
aliasing, etc.  If we run structure and array splitting prior to
vars_to_ssa, we could make it only consider non-array vector or scalar
variables and get exactly the same effect.  Gone would be the pile of data
structure that we build just to determine if a particular array dimension
is indirected.

This series can be found on my gitlab here:

https://gitlab.freedesktop.org/jekstrand/mesa/commits/wip/nir-var-opts

Cc: Timothy Arceri <tarceri at itsqueeze.com>

Jason Ekstrand (12):
  util/list: Make some helpers take const lists
  nir: Take if uses into account in ssa_def_components_read
  nir/print: Remove a bogus assert
  nir/instr_set: Fix nir_instrs_equal for derefs
  nir/types: Add array_or_matrix helpers
  nir: Add a structure splitting pass
  nir: Add an array splitting pass
  intel/nir: Use the new structure and array splitting passes
  nir: Add a array-of-vector variable narrowing pass
  intel/nir: Use narrow_vec_vars
  nir: Add an array copy optimization
  intel/nir: Enable nir_opt_find_array_copies

 src/compiler/Makefile.sources                |    2 +
 src/compiler/nir/meson.build                 |    2 +
 src/compiler/nir/nir.c                       |    3 +
 src/compiler/nir/nir.h                       |    5 +
 src/compiler/nir/nir_instr_set.c             |    4 +-
 src/compiler/nir/nir_opt_find_array_copies.c |  376 ++++++
 src/compiler/nir/nir_print.c                 |    1 -
 src/compiler/nir/nir_split_vars.c            | 1219 ++++++++++++++++++
 src/compiler/nir_types.cpp                   |   15 +
 src/compiler/nir_types.h                     |    2 +
 src/intel/compiler/brw_nir.c                 |    4 +
 src/util/list.h                              |    8 +-
 12 files changed, 1634 insertions(+), 7 deletions(-)
 create mode 100644 src/compiler/nir/nir_opt_find_array_copies.c
 create mode 100644 src/compiler/nir/nir_split_vars.c

-- 
2.17.1