[Mesa-dev] [PATCH 2/2] i965: run brw_vec4 optimizations in loop

Wed Nov 25 16:16:47 PST 2015

On Wed, Nov 25, 2015 at 5:36 AM, Juan A. Suarez Romero
<jasuarez at igalia.com> wrote:
> When analyzing output for glsl-mat-from-int-ctor-03 piglit test, found
> that the following piece of generated asm code:
>
> mov(8)          g19<1>.xyzF     g6<4,4,1>.xyzzD                 { align16 1Q };
> dp4(8)          g115<1>.wF      g4<4,4,1>F      g2.4<0,4,1>F    { align16 NoDDChk 1Q };
> cmp.nz.f0(8)    null<1>F        g11<4,4,1>.xyzzF g19<4,4,1>.xyzzF { align16 1Q switch };
> cmp.nz.f0(8)    null<1>D        g7<4,4,1>D      0D              { align16 1Q switch };
> (+f0.any4h) mov(8) g21<1>.xUD   0xffffffffUD                    { align16 1Q };

In the piglit test you cited, opt_vector_float() does this to the relevant code:

 cmp.nz.f0.0 null:F, vgrf6.xyzz:F, vgrf14.xyzz:F
-mov vgrf2.0.x:D, 0D
 (+f0.0.any4h) mov vgrf2.0.x:D, -1D
-mov vgrf2.0.yzw:D, 0D
+mov vgrf2.0:F, [0F, 0F, 0F, 0F]
 cmp.nz.f0.0 null:D, vgrf2.xyzw:D, 0D

which allows dead_code_eliminate to do this:

 cmp.nz.f0.0 null:F, vgrf6.xyzz:F, vgrf14.xyzz:F
-(+f0.0.any4h) mov vgrf2.0.x:D, -1D
 mov vgrf2.0:F, [0F, 0F, 0F, 0F]
 cmp.nz.f0.0 null:D, vgrf2.xyzw:D, 0D

But we should have recognized that the first CMP's flag result is no
longer used and deleted it in the same pass. That bug is caused by

   for (unsigned c = 0; c < 4; c++) {
      if (inst->reads_flag(c)) {
         BITSET_SET(flag_live, c);
      }
   }

still executing even when we have marked the current inst for removal.
I had to write the patch to confirm this was the problem, so I'll send
it and Cc you.

Another problem that you may be interested in solving also appears in
this same code. You'll notice we're left with

 mov vgrf2.0:F, [0F, 0F, 0F, 0F]
 cmp.nz.f0.0 null:D, vgrf2.xyzw:D, 0D
  mov vgrf16.0.x:D, 0D
  (+f0.0.any4h) mov vgrf16.0.x:D, -1D

which we should be able to evaluate at compile-time.

Figuring out how to constant-evaluate operations in a better way than
what we do today (see the opt_algebraic pass), perhaps as part of
constant propagation would be a good project. From there, figuring out
how to constant propagate flag register values in order to remove
predication is the next step.

>
> can be further optimized if we remove the former cmp instruction, as the
> output is overwritten by the second one (a clearly dead-code).
>
> The reason why this was not optimized is in vec4_visitor::run()
> (brw_vec4.cpp file). If opt_vector_float() success, then we apply
> several optimizations once, including a dead-code eliminate.
>
> But turns out that running this DCE optimization leds to the code above,
> and as we don't run again the DCE (we don't run them in a loop), the
> code remains as above.
>
> So this commit run all those optimizations in a loop. More strictly,
> instead of running them in a separated loop, it runs them in the
> previous loop, with all the other optimizations.
>
> When testing against the improvement obtained against shader-db, we
> don't get any improvement (nor any loose). But testing also against the
> non-free shaders in shader-db, we get the following improvement:
>
> total instructions in shared programs: 6819828 -> 6819468 (-0.01%)
> instructions in affected programs:     30516 -> 30156 (-1.18%)
> total loops in shared programs:        1971 -> 1971 (0.00%)
> helped:                                154
> HURT:                                  0
> GAINED:                                0
> LOST:                                  0

I had a look at a helped shader from shader-db (borderlands-2/3701)

Its vec4 IR at the time we call opt_vector_float() contains:

   0: mov vgrf2.0.x:D, 1073741824D
   1: mov vgrf3.0.xy:D, 0D
   2: mov vgrf3.0.w:D, 1065353216D
   3: mov vgrf4.0.xy:D, 1065353216D
   4: mov vgrf4.0.w:D, 0D

The .z component of vgrf3 and vgrf4 isn't completely dead, so the
opt_vector_float() pass doesn't handle them because it cannot find
writes to all 4 components. By moving opt_vector_float() into the main
optimization loop, we simply get lucky that the .z channel hasn't yet
been eliminated and so it makes progress.

I strongly suspect the other cases in shader-db have the same problem.

I think in this case a better solution -- at least until we learn
something more -- would be to initialize opt_vector_float()'s
remaining_channels to only the live components of the register. Do you
want to try that?