[Mesa-dev] [PATCH] glsl: Make copy propagation not panic when it sees an intrinsic.

Mon Dec 12 00:35:51 UTC 2016

On Sun, 2016-12-11 at 00:00 -0800, Kenneth Graunke wrote:
> On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote:
> > 
> > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth at whitecape.
> > org> wrote:
> > > 
> > > A number of games have large arrays of constants, which we
> > > promote to
> > > uniforms.  This introduces copies from the uniform array to the
> > > original
> > > temporary array.  Normally, copy propagation eliminates those
> > > copies,
> > > making everything refer to the uniform array directly.
> > > 
> > > A number of shaders in "Deus Ex: Mankind Divided" recently
> > > exposed a
> > > limitation of copy propagation - if we had any intrinsics (i.e.
> > > image
> > > access in a compute shader), we weren't able to get rid of these
> > > copies.
> > > 
> > > That meant that any variable indexing remained on the temporary
> > > array
> > > rather being moved to the uniform array.  i965's scalar backend
> > > currently doesn't support indirect addressing of temporary
> > > arrays,
> > > which meant lowering it to if-ladders.  This was horrible.
> > > 
> > > On Skylake:
> > > 
> > > total instructions in shared programs: 13700090 -> 13654519 (-
> > > 0.33%)
> > > instructions in affected programs: 56438 -> 10867 (-80.75%)
> > 
> > Wow!
> > 
> > > 
> > > helped: 14
> > > HURT: 0
> > > 
> > > total cycles in shared programs: 288879704 -> 291270232 (0.83%)
> > > cycles in affected programs: 12758080 -> 15148608 (18.74%)
> > 
> > ... that seems nuts?
> > 
> > Any idea what's going on with the cycle counts?
> 
> Good point...I glossed over the cycle counts when I saw the -80%
> reduction in instructions with 0 shaders hurt.  But they do look
> pretty bad, so let's take a closer look...
> 
> There are two nearly identical shaders that are the worst offenders:
> 
> shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS
> SIMD16:
> 
>     instructions: 2770 -> 253 (-2,517 instructions or -90.87%)
>     spills:         25 -> 0
>     fills:          29 -> 0
>     cycles:     923266 -> 1420534 (+497,268 cycles or +53.86%)
>     compile time: 2.73 seconds -> 0.17 seconds
> 
> There are three loops in the program, each of which contains two
> indirect reads of the uvec4[98] constant array.
> 
> Before this patch, there were:
>  - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program
>  - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop
>  - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop
>  - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop
> 
> After this patch, there are:
>  - 0 loads at the top of the program
>  - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop
>  - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop
>  - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop
> 
> The array indexes in the expensive loop are a[foo] and a[foo + 1].
> foo is modified in the loop, so they can't be hoisted out.  I don't
> think we can determine the number of loop iterations.

The first loop has a trip count of 49. If I force it to unroll I get:

CS SIMD16 shader: 682 inst, 2 loops, 1009856 cycles.

The current unrolling rules are:

static bool
is_loop_small_enough_to_unroll(nir_shader *shader, nir_loop_info *li)
{
   unsigned max_iter = shader->options->max_unroll_iterations;

   if (li->trip_count > max_iter)
      return false;

   if (li->force_unroll) // this is for consecutive indirect array
access
      return true;

   bool loop_not_too_large =
      li->num_instructions * li->trip_count <= max_iter * 25;

   return loop_not_too_large;
}

Maybe we should just drop the first trip count check? Checking the
number of intructions * trip count makes a lot more sense. Dropping the
check allows the first loop to unroll.

> 
> The two expensive loops look to be twice as expensive after this
> patch.
> The numbers aren't quite adding up for me - it looks like we should
> spend 200 more cycles per loop iteration, but the loops are like
> 40,000
> -> 90,000 cycles.
> 
> I'm not sure what to do with this information.  Eliminating 90% of
> the
> instructions seems good.  Requiring no scratch access seems good.
> Eliminating the 67 memory loads outside of the loops seems good.
> Doing two memory loads per loop doesn't seem too crazy, given that
> it matches the GLSL source code.  Burning 49 registers to store the
> entire array for the lifetime of the program seems pretty crazy...
> 
> --Ken
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev