[Mesa-dev] [PATCH] glsl: Make copy propagation not panic when it sees an intrinsic.

Sun Dec 11 08:00:48 UTC 2016

On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote:
> On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> > A number of games have large arrays of constants, which we promote to
> > uniforms.  This introduces copies from the uniform array to the original
> > temporary array.  Normally, copy propagation eliminates those copies,
> > making everything refer to the uniform array directly.
> >
> > A number of shaders in "Deus Ex: Mankind Divided" recently exposed a
> > limitation of copy propagation - if we had any intrinsics (i.e. image
> > access in a compute shader), we weren't able to get rid of these copies.
> >
> > That meant that any variable indexing remained on the temporary array
> > rather being moved to the uniform array.  i965's scalar backend
> > currently doesn't support indirect addressing of temporary arrays,
> > which meant lowering it to if-ladders.  This was horrible.
> >
> > On Skylake:
> >
> > total instructions in shared programs: 13700090 -> 13654519 (-0.33%)
> > instructions in affected programs: 56438 -> 10867 (-80.75%)
> 
> Wow!
> 
> > helped: 14
> > HURT: 0
> >
> > total cycles in shared programs: 288879704 -> 291270232 (0.83%)
> > cycles in affected programs: 12758080 -> 15148608 (18.74%)
> 
> ... that seems nuts?
> 
> Any idea what's going on with the cycle counts?

Good point...I glossed over the cycle counts when I saw the -80%
reduction in instructions with 0 shaders hurt.  But they do look
pretty bad, so let's take a closer look...

There are two nearly identical shaders that are the worst offenders:

shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS SIMD16:

    instructions: 2770 -> 253 (-2,517 instructions or -90.87%)
    spills:         25 -> 0
    fills:          29 -> 0
    cycles:     923266 -> 1420534 (+497,268 cycles or +53.86%)
    compile time: 2.73 seconds -> 0.17 seconds

There are three loops in the program, each of which contains two
indirect reads of the uvec4[98] constant array.

Before this patch, there were:
 - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program
 - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop
 - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop
 - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop

After this patch, there are:
 - 0 loads at the top of the program
 - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop
 - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop
 - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop

The array indexes in the expensive loop are a[foo] and a[foo + 1].
foo is modified in the loop, so they can't be hoisted out.  I don't
think we can determine the number of loop iterations.

The two expensive loops look to be twice as expensive after this patch.
The numbers aren't quite adding up for me - it looks like we should
spend 200 more cycles per loop iteration, but the loops are like 40,000
-> 90,000 cycles.

I'm not sure what to do with this information.  Eliminating 90% of the
instructions seems good.  Requiring no scratch access seems good.
Eliminating the 67 memory loads outside of the loops seems good.
Doing two memory loads per loop doesn't seem too crazy, given that
it matches the GLSL source code.  Burning 49 registers to store the
entire array for the lifetime of the program seems pretty crazy...

--Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161211/4536018a/attachment.sig>