[Mesa-dev] [PATCH] glsl: Make copy propagation not panic when it sees an intrinsic.

Marek Olšák maraeo at gmail.com
Mon Dec 12 15:35:46 UTC 2016


On Mon, Dec 12, 2016 at 6:28 AM, Timothy Arceri
<timothy.arceri at collabora.com> wrote:
> On Mon, 2016-12-12 at 11:35 +1100, Timothy Arceri wrote:
>> On Sun, 2016-12-11 at 00:00 -0800, Kenneth Graunke wrote:
>> >
>> > On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote:
>> > >
>> > >
>> > > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth at whitecap
>> > > e.
>> > > org> wrote:
>> > > >
>> > > >
>> > > > A number of games have large arrays of constants, which we
>> > > > promote to
>> > > > uniforms.  This introduces copies from the uniform array to the
>> > > > original
>> > > > temporary array.  Normally, copy propagation eliminates those
>> > > > copies,
>> > > > making everything refer to the uniform array directly.
>> > > >
>> > > > A number of shaders in "Deus Ex: Mankind Divided" recently
>> > > > exposed a
>> > > > limitation of copy propagation - if we had any intrinsics (i.e.
>> > > > image
>> > > > access in a compute shader), we weren't able to get rid of
>> > > > these
>> > > > copies.
>> > > >
>> > > > That meant that any variable indexing remained on the temporary
>> > > > array
>> > > > rather being moved to the uniform array.  i965's scalar backend
>> > > > currently doesn't support indirect addressing of temporary
>> > > > arrays,
>> > > > which meant lowering it to if-ladders.  This was horrible.
>> > > >
>> > > > On Skylake:
>> > > >
>> > > > total instructions in shared programs: 13700090 -> 13654519 (-
>> > > > 0.33%)
>> > > > instructions in affected programs: 56438 -> 10867 (-80.75%)
>> > >
>> > > Wow!
>> > >
>> > > >
>> > > >
>> > > > helped: 14
>> > > > HURT: 0
>> > > >
>> > > > total cycles in shared programs: 288879704 -> 291270232 (0.83%)
>> > > > cycles in affected programs: 12758080 -> 15148608 (18.74%)
>> > >
>> > > ... that seems nuts?
>> > >
>> > > Any idea what's going on with the cycle counts?
>> >
>> > Good point...I glossed over the cycle counts when I saw the -80%
>> > reduction in instructions with 0 shaders hurt.  But they do look
>> > pretty bad, so let's take a closer look...
>> >
>> > There are two nearly identical shaders that are the worst
>> > offenders:
>> >
>> > shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS
>> > SIMD16:
>> >
>> >     instructions: 2770 -> 253 (-2,517 instructions or -90.87%)
>> >     spills:         25 -> 0
>> >     fills:          29 -> 0
>> >     cycles:     923266 -> 1420534 (+497,268 cycles or +53.86%)
>> >     compile time: 2.73 seconds -> 0.17 seconds
>> >
>> > There are three loops in the program, each of which contains two
>> > indirect reads of the uvec4[98] constant array.
>> >
>> > Before this patch, there were:
>> >  - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program
>> >  - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop
>> >  - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop
>> >  - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop
>> >
>> > After this patch, there are:
>> >  - 0 loads at the top of the program
>> >  - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop
>> >  - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop
>> >  - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop
>> >
>> > The array indexes in the expensive loop are a[foo] and a[foo + 1].
>> > foo is modified in the loop, so they can't be hoisted out.  I don't
>> > think we can determine the number of loop iterations.
>>
>> The first loop has a trip count of 49. If I force it to unroll I get:
>>
>> CS SIMD16 shader: 682 inst, 2 loops, 1009856 cycles.
>>
>> The current unrolling rules are:
>>
>> static bool
>> is_loop_small_enough_to_unroll(nir_shader *shader, nir_loop_info *li)
>> {
>>    unsigned max_iter = shader->options->max_unroll_iterations;
>>
>>    if (li->trip_count > max_iter)
>>       return false;
>>
>>    if (li->force_unroll) // this is for consecutive indirect array
>> access
>>       return true;
>>
>>    bool loop_not_too_large =
>>       li->num_instructions * li->trip_count <= max_iter * 25;
>>
>>    return loop_not_too_large;
>> }
>>
>>
>> Maybe we should just drop the first trip count check? Checking the
>> number of intructions * trip count makes a lot more sense. Dropping
>> the
>> check allows the first loop to unroll.
>
> I tried to get this working for GLSL IR loop unrolling and fixed a
> couple of the other problems, panicing at intrinsics (partialy fixed
> -well I just ignore them for now), and nequal condition.
>
> But I also ran into futher issues:
>
> 1) Not being able to detect an induction variable because we end up
> with:
>
> (assign  (x) (var_ref compiler_temp at 19)  (expression uint bitcast_f2u
> (swiz y (var_ref r1) )) )
>                 (if (expression bool != (var_ref compiler_temp at 19)
> (constant uint (0)) ) (
>                   break
>                 )

Yeah, the GLSL compiler can't really do anything if it sees a bitcast.
I tried to write a pass that removes bitcasts, but it's a PITA because
you can have a vec4 where xzw are used as float and y as int.

I just gave up and figured that unrolling in LLVM IR would be simpler.

Marek


More information about the mesa-dev mailing list