[Mesa-dev] Compiling of shader gets stuck in infinite loop

Fri Sep 12 10:14:17 PDT 2014

This extremely slow compilation is not actually an infinite loop.
But the compile time does increase with every unrolled loop step in the
shader.
The time to complete is 2^N, where N is the number of loop iterations.

The call to
 (*rvalue)->accept(this);
in ir_constant_folding_visitor::handle_rvalue is key to this.
Dropping that call for the case when rvalue is not a constant makes
compilation
finish very quickly.  And for at least this shader it produces exactly the
same results.  Constant folding is done very effectively for the y and z
channels.

But the x channel still produces a series of adds of constants instead of
one add with the sum.
That is a separate issue that could still be investigated.

On Thu, Sep 11, 2014 at 1:53 PM, Mike Stroyan <mike at lunarg.com> wrote:

> I have looked at this problem quite a bit but never got to the bottom of
> it.
> This problem really started to show with commit 857f3a6 - "glsl: Ignore
> loop-too-large heuristic if there's bad variable indexing."
> That commit makes many more loops unroll.
> Here is another example piglit shader_runner test that shows the problem.
> Changing the value of LOOP_COUNT and running this with "time shader_runner
> -auto"
> shows that the compile time doubles each time the loop count is
> incremented by one.
> Large values may seem to take forever.  But they do eventually finish.
> Loop counts over 32 will still prevent unrolling and avoid the slow
> compile.
>
> A key part of the problem is the assignment to "col.rgb" in your shader or
> "tmpvar_3.xyz" in this shader.
> The operation on only some channels results in splitting the vec4 into one
> temporary per channel.
> This comment from src/mesa/drivers/dri/i965/brw_fs_vector_splitting.cpp is
> telling.
>  27│  * If a vector is only ever referenced by its components, then
>  28│  * split those components out to individual variables so they can be
>  29│  * handled normally by other optimization passes.
>
> brw_do_vector_splitting creates the flattening_tmp_y and flattening_tmp_z
> temporaries.
> Operations on one of the channels are optimized quickly.
> But the other two channels are handled badly.
> The operations on the first channel prevent the same simplification of the
> expressions for the other two channels.
>
> Changing ir_vector_splitting_visitor::visit_leave to use "writemask = 1 <<
> i;" instead of "writemask = 1;"
> in the "if (lhs)" case makes the y and z channels get handled like the x
> channel.
> That results in something like
>       (assign  (y) (var_ref flattening_tmp_y)  (expression float * (swiz y
> (var_ref texture2D_retval) )(var_ref channel_expressions at 8114) ) )
> It is very fast to compile, but produces bad code that hangs the GPU.
> It is putting the y channel float value into a non-existent "y" channel of
> a simple float temporary, then later reading the real x channel.
>
> [require]
> GLSL >= 1.10
>
> [vertex shader]
> #version 120
> attribute vec2 Tex0;
> attribute vec3 Position;
> void main ()
> {
>   vec4 inPos_1;
>   inPos_1.xy = Position.xy;
>   inPos_1.z = 1.00000;
>   inPos_1.w = 1.00000;
>   gl_Position = inPos_1;
>   vec4 tmpvar_2;
>   tmpvar_2.zw = vec2(0.00000, 0.00000);
>   tmpvar_2.xy = Tex0;
>   gl_TexCoord[0] = tmpvar_2;
> }
>
> [fragment shader]
> #version 120
> #define LOOP_COUNT 25
> uniform sampler2D u_sampler;
> void main ()
> {
>   vec2 tmpvar_1;
>   tmpvar_1 = gl_TexCoord[0].xy;
>   vec4 tmpvar_3;
>   tmpvar_3 = vec4(0.00000, 0.00000, 0.00000, 1.00000);
>   float weighting_5[LOOP_COUNT];
>   for (int i = 0; i < LOOP_COUNT; i++) {
>     float tmpvar_10;
>     tmpvar_10 = ((float(int(abs ((float(i) - 15.0))))) / 15.0000);
>     float tmpvar_11;
>     tmpvar_11 = exp ((-(tmpvar_10) * tmpvar_10));
>     weighting_5[i] = tmpvar_11;
>   };
>   for (int k = 0; k < LOOP_COUNT; k++) {
>     tmpvar_3.xyz += (texture2D (u_sampler, tmpvar_1).xyz * weighting_5[k]);
>   };
>   gl_FragData[0] = tmpvar_3;
> }
>
> [test]
> draw rect -1 -1 2 2
> probe rgb 1 1 0.0 0.0 0.0
>
>
> On Thu, Sep 11, 2014 at 2:02 AM, Iago Toral Quiroga <itoral at igalia.com>
> wrote:
>
>> Hi,
>>
>> I have been looking into this bug:
>>
>> Compiling of shader gets stuck in infinite loop
>> https://bugs.freedesktop.org/show_bug.cgi?id=78468
>>
>> Although this occurs at link time when the Intel driver has run some of
>> its specific lowering passes, it looks like the problem could hit other
>> drivers if the right conditions are met, as the actual problem happens
>> inside common optimization passes.
>>
>> I reproduced the problem with a very simple shader like this:
>>
>> uniform sampler2D tex;
>> out vec4 FragColor;
>> void main()
>> {
>>    vec4 col = texture(tex, vec2(0, 0));
>>    for (int i=0; i<30; i++)
>>       col += vec4(0.1, 0.1, 0.1, 0.1);
>>    col = vec4(col.rgb / 2.0, col.a);
>>    FragColor = col;
>> }
>>
>> and for this shader, I traced the problem down to the fact that
>> do_tree_grafting() is generating instructions like this:
>>
>> (assign  (x) (var_ref flattening_tmp_y at 116)  (expression float * (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (swiz x (expression float + (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (swiz x (expression float + (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (swiz x (expression float + (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (swiz x (expression float + (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (swiz x (expression float + (swiz x
>> (expression float + (swiz x (expression float + (swiz x (expression
>> float + (swiz x (expression float + (var_ref col_y) (constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.100000)) ) )(constant float
>> (0.100000)) ) )(constant float (0.500000)) ) )
>>
>> And when we feed these to do_constant_folding() it takes forever to
>> finish. For this shader in particular, removing the tree grafting pass
>> from do_common_optimization eliminates the problem.
>>
>> Notice that small, seemingly irrelevant changes to the shader code, can
>> make it so that this never happens. For example, if we initialize 'col'
>> to something like vec4(0,0,0,0) instead of using the texture function,
>> or we remove the division by 2.0 in the last assignment to 'col', these
>> instructions are never produced and the shader compiles okay.
>>
>> The number of iterations in the loop is also important, if we have too
>> many we do not unroll the loop and the problem never happens, if we have
>> too few, rather than generating a super large tree of expressions like
>> above, we generate something like this and the problem, again, does not
>> happen: (notice how it adds 0.1 nine times to make 0.9 rather than
>> chaining 9 add expressions for 10 iterations of the loop):
>>
>> (assign  (x) (var_ref flattening_tmp_y)  (expression float * (expression
>> float + (constant float (0.900000)) (var_ref col_y) ) (constant float
>> (0.500000)) ) )
>>
>> So it seems that whether we generate a huge chunk of expressions or not
>> is subject to a number of factors, but when the right conditions are met
>> we can generate code that can stall compilation forever.
>>
>> Reading what tree grafting is supposed to do, this does not seem to be
>> an unexpected result though, so I wonder what would be the right way to
>> fix this. It would look like we would want to do whatever we are doing
>> when we only have a few iterations in the loop, but I don't know why we
>> generate different code in that case and I am not familiar enough with
>> all the optimization and lowering passes to assess what would make sense
>> to do here... so, any suggestions?
>>
>> Iago
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
>
> --
>
>  Mike Stroyan - Software Architect
>  LunarG, Inc.  - The Graphics Experts
>  Cell:  (970) 219-7905
>  Email: Mike at LunarG.com
>  Website: http://www.lunarg.com
>

-- 

 Mike Stroyan - Software Architect
 LunarG, Inc.  - The Graphics Experts
 Cell:  (970) 219-7905
 Email: Mike at LunarG.com
 Website: http://www.lunarg.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20140912/dbf44d1b/attachment-0001.html>