[Mesa-dev] [PATCH] glsl: Make copy propagation not panic when it sees an intrinsic.
Timothy Arceri
timothy.arceri at collabora.com
Mon Dec 12 00:35:51 UTC 2016
On Sun, 2016-12-11 at 00:00 -0800, Kenneth Graunke wrote:
> On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote:
> >
> > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth at whitecape.
> > org> wrote:
> > >
> > > A number of games have large arrays of constants, which we
> > > promote to
> > > uniforms. This introduces copies from the uniform array to the
> > > original
> > > temporary array. Normally, copy propagation eliminates those
> > > copies,
> > > making everything refer to the uniform array directly.
> > >
> > > A number of shaders in "Deus Ex: Mankind Divided" recently
> > > exposed a
> > > limitation of copy propagation - if we had any intrinsics (i.e.
> > > image
> > > access in a compute shader), we weren't able to get rid of these
> > > copies.
> > >
> > > That meant that any variable indexing remained on the temporary
> > > array
> > > rather being moved to the uniform array. i965's scalar backend
> > > currently doesn't support indirect addressing of temporary
> > > arrays,
> > > which meant lowering it to if-ladders. This was horrible.
> > >
> > > On Skylake:
> > >
> > > total instructions in shared programs: 13700090 -> 13654519 (-
> > > 0.33%)
> > > instructions in affected programs: 56438 -> 10867 (-80.75%)
> >
> > Wow!
> >
> > >
> > > helped: 14
> > > HURT: 0
> > >
> > > total cycles in shared programs: 288879704 -> 291270232 (0.83%)
> > > cycles in affected programs: 12758080 -> 15148608 (18.74%)
> >
> > ... that seems nuts?
> >
> > Any idea what's going on with the cycle counts?
>
> Good point...I glossed over the cycle counts when I saw the -80%
> reduction in instructions with 0 shaders hurt. But they do look
> pretty bad, so let's take a closer look...
>
> There are two nearly identical shaders that are the worst offenders:
>
> shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS
> SIMD16:
>
> instructions: 2770 -> 253 (-2,517 instructions or -90.87%)
> spills: 25 -> 0
> fills: 29 -> 0
> cycles: 923266 -> 1420534 (+497,268 cycles or +53.86%)
> compile time: 2.73 seconds -> 0.17 seconds
>
> There are three loops in the program, each of which contains two
> indirect reads of the uvec4[98] constant array.
>
> Before this patch, there were:
> - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program
> - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop
> - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop
> - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop
>
> After this patch, there are:
> - 0 loads at the top of the program
> - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop
> - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop
> - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop
>
> The array indexes in the expensive loop are a[foo] and a[foo + 1].
> foo is modified in the loop, so they can't be hoisted out. I don't
> think we can determine the number of loop iterations.
The first loop has a trip count of 49. If I force it to unroll I get:
CS SIMD16 shader: 682 inst, 2 loops, 1009856 cycles.
The current unrolling rules are:
static bool
is_loop_small_enough_to_unroll(nir_shader *shader, nir_loop_info *li)
{
unsigned max_iter = shader->options->max_unroll_iterations;
if (li->trip_count > max_iter)
return false;
if (li->force_unroll) // this is for consecutive indirect array
access
return true;
bool loop_not_too_large =
li->num_instructions * li->trip_count <= max_iter * 25;
return loop_not_too_large;
}
Maybe we should just drop the first trip count check? Checking the
number of intructions * trip count makes a lot more sense. Dropping the
check allows the first loop to unroll.
>
> The two expensive loops look to be twice as expensive after this
> patch.
> The numbers aren't quite adding up for me - it looks like we should
> spend 200 more cycles per loop iteration, but the loops are like
> 40,000
> -> 90,000 cycles.
>
> I'm not sure what to do with this information. Eliminating 90% of
> the
> instructions seems good. Requiring no scratch access seems good.
> Eliminating the 67 memory loads outside of the loops seems good.
> Doing two memory loads per loop doesn't seem too crazy, given that
> it matches the GLSL source code. Burning 49 registers to store the
> entire array for the lifetime of the program seems pretty crazy...
>
> --Ken
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list