[Mesa-dev] [PATCH 07/21] i965/fs: Implement SIMD16 64-bit integer multiplies on Gen 8.

Tue Dec 23 08:51:33 PST 2014

On 12/22/14, Ben Widawsky <benjamin.widawsky at intel.com> wrote:
> This patch uses the new QWORD type introduced on Gen8. This allows us to
> perform
> the operation without requiring the additional MACH.
>
> Similar to Gen7, it seems we must demote SIMD16 to 2 SIMD8s. On the bright
> side,
> we get the results in 3 instructions, and no MACH. MACH is undesirable
> because
> it requires the accumulator write flag, which can hinder optimization
> passes.
>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
>  src/mesa/drivers/dri/i965/brw_defines.h        |  2 +
>  src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 +++++
>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp   | 73
> ++++++++++++++------------
>  src/mesa/drivers/dri/i965/brw_shader.cpp       |  2 +
>  4 files changed, 56 insertions(+), 33 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_defines.h
> b/src/mesa/drivers/dri/i965/brw_defines.h
> index 28e398d..102ba4a 100644
> --- a/src/mesa/drivers/dri/i965/brw_defines.h
> +++ b/src/mesa/drivers/dri/i965/brw_defines.h
> @@ -1100,6 +1100,8 @@ enum opcode {
>      *   and number of SO primitives needed.
>      */
>     GS_OPCODE_FF_SYNC_SET_PRIMITIVES,
> +
> +   SHADER_OPCODE_MOV64,
>  };
>
>  enum brw_urb_write_flags {
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> index c652d65..3a15837 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> @@ -1619,6 +1619,18 @@ fs_generator::generate_code(const cfg_t *cfg, int
> dispatch_width)
>        case BRW_OPCODE_MUL:
>  	 brw_MUL(p, dst, src[0], src[1]);
>  	 break;
> +      case SHADER_OPCODE_MOV64:
> +         /* This opcode is used to mov the result of a native SIMD16 (2
> SIMD8s)
> +          * mul into the dst register.
> +          */
> +         assert(brw->gen >= 8);
> +         src[0].subnr = 4;
> +         src[0].type = dst.type;
> +         src[0] = stride(src[0], 8, 4, 2);
> +         assert(dst.type == BRW_REGISTER_TYPE_UD ||
> +                dst.type == BRW_REGISTER_TYPE_D);
> +         brw_MOV(p, dst, src[0]);
> +         break;
>        case BRW_OPCODE_AVG:
>  	 brw_AVG(p, dst, src[0], src[1]);
>  	 break;
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> index de03618..98f1b0d 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> @@ -798,46 +798,53 @@ fs_visitor::visit(ir_expression *ir)
>  	 emit(MUL(this->result, op[0], op[1]));
>        }
>        break;
> -   case ir_binop_imul_high: {
> -      if (brw->gen >= 8)
> -         no16("SIMD16 explicit accumulator operands unsupported\n");
> +   case ir_binop_imul_high:
> +      if (brw->gen >= 8) {
> +         /* Gen8 is able to do the full 32x32 multiply into a QWORD.
> +          * The docs say you cannot use direct addressing for a destination
> of
> +          * more than 2 registers, which is the case in SIMD16. Therefore,
> like
> +          * Gen7, the operation must be downgraded to two SIMD8 muls.
> +          *
> +          * Oddly, the results can be gathered with 1 mov operation, even
> though
> +          * the docs suggest that shouldn't work.
> +          */

Does the simulator not complain about this? I think it definitely
should. The docs are clear that you can't read from >2 consecutive
registers.

I suspect that again the piglit tests are insufficient. I think all of
the piglit tests are calculating the same result in each channel.