<div dir="ltr">This extremely slow compilation is not actually an infinite loop. But the compile time does increase with every unrolled loop step in the shader. The time to complete is 2^N, where N is the number of loop iterations. The call to (*rvalue)->accept(this); in ir_constant_folding_visitor::handle_rvalue is key to this. Dropping that call for the case when rvalue is not a constant makes compilation finish very quickly. And for at least this shader it produces exactly the same results. Constant folding is done very effectively for the y and z channels. But the x channel still produces a series of adds of constants instead of one add with the sum. That is a separate issue that could still be investigated. </div><div class="gmail_extra"> <div class="gmail_quote">On Thu, Sep 11, 2014 at 1:53 PM, Mike Stroyan <<a href="mailto:mike@lunarg.com" target="_blank">mike@lunarg.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>I have looked at this problem quite a bit but never got to the bottom of it. This problem really started to show with commit 857f3a6 - "glsl: Ignore loop-too-large heuristic if there's bad variable indexing." That commit makes many more loops unroll. Here is another example piglit shader_runner test that shows the problem. Changing the value of LOOP_COUNT and running this with "time shader_runner -auto" shows that the compile time doubles each time the loop count is incremented by one. </div><div>Large values may seem to take forever. But they do eventually finish. </div>Loop counts over 32 will still prevent unrolling and avoid the slow compile. </div>A key part of the problem is the assignment to "col.rgb" in your shader or "tmpvar_3.xyz" in this shader. </div>The operation on only some channels results in splitting the vec4 into one temporary per channel. This comment from src/mesa/drivers/dri/i965/brw_fs_vector_splitting.cpp is telling. 27│ * If a vector is only ever referenced by its components, then 28│ * split those components out to individual variables so they can be 29│ * handled normally by other optimization passes. brw_do_vector_splitting creates the flattening_tmp_y and flattening_tmp_z temporaries. </div>Operations on one of the channels are optimized quickly. </div>But the other two channels are handled badly. The operations on the first channel prevent the same simplification of the expressions for the other two channels. Changing ir_vector_splitting_visitor::visit_leave to use "writemask = 1 << i;" instead of "writemask = 1;" in the "if (lhs)" case makes the y and z channels get handled like the x channel. That results in something like (assign (y) (var_ref flattening_tmp_y) (expression float * (swiz y (var_ref texture2D_retval) )(var_ref channel_expressions@8114) ) ) It is very fast to compile, but produces bad code that hangs the GPU. It is putting the y channel float value into a non-existent "y" channel of a simple float temporary, then later reading the real x channel. <div><div><div><div><div> [require] GLSL >= 1.10 [vertex shader] #version 120 attribute vec2 Tex0; attribute vec3 Position; void main () { vec4 inPos_1; inPos_1.xy = Position.xy; inPos_1.z = 1.00000; inPos_1.w = 1.00000; gl_Position = inPos_1; vec4 tmpvar_2; <a href="http://tmpvar_2.zw" target="_blank">tmpvar_2.zw</a> = vec2(0.00000, 0.00000); tmpvar_2.xy = Tex0; gl_TexCoord[0] = tmpvar_2; } [fragment shader] #version 120 #define LOOP_COUNT 25 uniform sampler2D u_sampler; void main () { vec2 tmpvar_1; tmpvar_1 = gl_TexCoord[0].xy; vec4 tmpvar_3; tmpvar_3 = vec4(0.00000, 0.00000, 0.00000, 1.00000); float weighting_5[LOOP_COUNT]; for (int i = 0; i < LOOP_COUNT; i++) { float tmpvar_10; tmpvar_10 = ((float(int(abs ((float(i) - 15.0))))) / 15.0000); float tmpvar_11; tmpvar_11 = exp ((-(tmpvar_10) * tmpvar_10)); weighting_5[i] = tmpvar_11; }; for (int k = 0; k < LOOP_COUNT; k++) { tmpvar_3.xyz += (texture2D (u_sampler, tmpvar_1).xyz * weighting_5[k]); }; gl_FragData[0] = tmpvar_3; } [test] draw rect -1 -1 2 2 probe rgb 1 1 0.0 0.0 0.0 </div></div></div></div></div></div><div class="gmail_extra"><div><div class="h5"> <div class="gmail_quote">On Thu, Sep 11, 2014 at 2:02 AM, Iago Toral Quiroga <<a href="mailto:itoral@igalia.com" target="_blank">itoral@igalia.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, I have been looking into this bug: Compiling of shader gets stuck in infinite loop <a href="https://bugs.freedesktop.org/show_bug.cgi?id=78468" target="_blank">https://bugs.freedesktop.org/show_bug.cgi?id=78468</a> Although this occurs at link time when the Intel driver has run some of its specific lowering passes, it looks like the problem could hit other drivers if the right conditions are met, as the actual problem happens inside common optimization passes. I reproduced the problem with a very simple shader like this: uniform sampler2D tex; out vec4 FragColor; void main() { vec4 col = texture(tex, vec2(0, 0)); for (int i=0; i<30; i++) col += vec4(0.1, 0.1, 0.1, 0.1); col = vec4(col.rgb / 2.0, col.a); FragColor = col; } and for this shader, I traced the problem down to the fact that do_tree_grafting() is generating instructions like this: (assign (x) (var_ref flattening_tmp_y@116) (expression float * (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (swiz x (expression float + (var_ref col_y) (constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.100000)) ) )(constant float (0.500000)) ) ) And when we feed these to do_constant_folding() it takes forever to finish. For this shader in particular, removing the tree grafting pass from do_common_optimization eliminates the problem. Notice that small, seemingly irrelevant changes to the shader code, can make it so that this never happens. For example, if we initialize 'col' to something like vec4(0,0,0,0) instead of using the texture function, or we remove the division by 2.0 in the last assignment to 'col', these instructions are never produced and the shader compiles okay. The number of iterations in the loop is also important, if we have too many we do not unroll the loop and the problem never happens, if we have too few, rather than generating a super large tree of expressions like above, we generate something like this and the problem, again, does not happen: (notice how it adds 0.1 nine times to make 0.9 rather than chaining 9 add expressions for 10 iterations of the loop): (assign (x) (var_ref flattening_tmp_y) (expression float * (expression float + (constant float (0.900000)) (var_ref col_y) ) (constant float (0.500000)) ) ) So it seems that whether we generate a huge chunk of expressions or not is subject to a number of factors, but when the right conditions are met we can generate code that can stall compilation forever. Reading what tree grafting is supposed to do, this does not seem to be an unexpected result though, so I wonder what would be the right way to fix this. It would look like we would want to do whatever we are doing when we only have a few iterations in the loop, but I don't know why we generate different code in that case and I am not familiar enough with all the optimization and lowering passes to assess what would make sense to do here... so, any suggestions? Iago _______________________________________________ mesa-dev mailing list <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a> </blockquote></div> </div></div>-- Mike Stroyan - Software Architect LunarG, Inc. - The Graphics Experts Cell: <a href="tel:%28970%29%20219-7905" value="+19702197905" target="_blank">(970) 219-7905</a> Email: Mike@LunarG.com Website: <a href="http://www.lunarg.com" target="_blank">http://www.lunarg.com</a> </div> </blockquote></div> -- Mike Stroyan - Software Architect LunarG, Inc. - The Graphics Experts Cell: (970) 219-7905 Email: Mike@LunarG.com Website: <a href="http://www.lunarg.com">http://www.lunarg.com</a> </div>