[Mesa-dev] [PATCH 1/2] i965: Use nir_lower_load_const_to_scalar().

Fri Jan 22 15:58:41 PST 2016

On Thu, Jan 21, 2016 at 4:37 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> I don't know why, but we never hooked up this pass Eric wrote.
> Otherwise, you can end up with stupid scalarized code such as:
>
>    vec4 ssa_7 = load_const (0.0, 0.0, 0.0, 0.0)
>    vec4 ssa_8 = ...
>    vec1 ssa_9 = feq ssa_8, ssa_7
>    vec1 ssa_10 = feq ssa_8.y, ssa_7.y
>    vec1 ssa_11 = feq ssa_8, ssa_7.z
>    vec1 ssa_12 = feq ssa_8.y, ssa_7.w
>
> ssa_8.xyxy == <0, 0, 0, 0> should only take two feq instructions.
>
> shader-db on Skylake:
>
> total instructions in shared programs: 9111788 -> 9111384 (-0.00%)
> instructions in affected programs: 32421 -> 32017 (-1.25%)
> helped: 277
> HURT: 69

All the hurt programs seem to have an extra instruction because of
interactions with multiply-add fusing. What we have with this patch
might even be better.

>
> total cycles in shared programs: 69221226 -> 69219394 (-0.00%)
> cycles in affected programs: 917796 -> 915964 (-0.20%)
> helped: 317
> HURT: 408

One weird thing here... ETQW/fp-259.shader_test goes from 54 -> 53
instructions (another multiply-add interaction) in both the SIMD8 and
SIMD16 programs, but the cycle estimate goes from 422 -> 432 in SIMD8
and 392 -> 570 in SIMD16.

There are four texture operations, and they're scheduled together in
SIMD8. In SIMD16, for some reason the one that reads surface 2 is
scheduled basically at the end of the program...

Also, how in the world could the SIMD16 cycle estimate be *lower* than
the SIMD8 cycle estimate?

>
> This also prevents regressions when disabling channel expressions.
>
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
> ---
>  src/mesa/drivers/dri/i965/brw_nir.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_nir.c b/src/mesa/drivers/dri/i965/brw_nir.c
> index 935529a..ce9b9db 100644
> --- a/src/mesa/drivers/dri/i965/brw_nir.c
> +++ b/src/mesa/drivers/dri/i965/brw_nir.c
> @@ -482,6 +482,11 @@ brw_preprocess_nir(nir_shader *nir, bool is_scalar)
>
>     nir = nir_optimize(nir, is_scalar);
>
> +   if (is_scalar) {
> +      OPT_V(nir_lower_load_const_to_scalar);
> +      OPT(nir_opt_cse);

Did you find the call to nir_opt_cse to be necessary? Removing it, I
only see the cycle estimate for trine-2/fp-2 go from 696 -> 704. I'd
probably leave it out.

Reviewed-by: Matt Turner <mattst88 at gmail.com>