[Mesa-dev] [PATCH 33/95] i965/vec4: implement d2b

Wed Aug 17 09:22:50 UTC 2016

On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral at igalia.com> writes:
> 
> > 
> > ---
> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > index 1525a3d..4014020 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > *instr)
> >        emit(CMP(dst, op[0], brw_imm_f(0.0f), BRW_CONDITIONAL_NZ));
> >        break;
> >  
> > +   case nir_op_d2b: {
> > +      /* two-argument instructions can't take 64-bit immediates */
> > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > +      emit(MOV(zero, brw_imm_df(0.0)));
> > +
> > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > +      emit(CMP(tmp, op[0], src_reg(zero), BRW_CONDITIONAL_NZ));
> > +
> > +      /* Convert the double CMP result to a single boolean result.
> > For that
> > +       * we take the low 32-bit chunk of each DF component in the
> > result.
> > +       * and do a final MOV to honor the original writemask
> > +       */
> > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > +      emit(MOV(dst, src_reg(result)));
> Couldn't you just do a single CMP instruction of the double-precision
> argument and a single-precision 0.0 immediate? 

It never occurred to me that we could do that but it seems to work just
fine.

>  I think you could
> potentially also use a 32bit destination type on the CMP instruction
> so
> you don't need to emit the PICK_LOW+MOV instructions afterwards.

This does not seem to work though.

>   It may
> hit an instruction decompression bug but it could be cleaned up by
> the
> SIMD lowering pass afterwards, AFAICT the result would be two
> uncompressed instructions instead of the two uncompressed plus two
> compressed instructions above.

I tried splitting the instruction just in case. Notice that doing this
requires that we improve the splitting pass to support splitting
instructions that write only one register (the original series did not
implement support for that because so far the only cared to split DF
instructions that wrote more than one register). I implemented a couple
of quick hacks to get this done so I could test this but the result is
still incorrect.

For reference, if we don't split, the resulting CMP looks like this
(after full scalarization):

cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };

And if we split the instruction in two we produce this:

cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };

Both sequences of instructions look correct to me as far as the
splitting and DF regioning go, so I guess the 32-bit CMP result is not
doing what we want.

I am not too surprised about this. Even if we use a 32b dst type I
suppose the instruction execution type is still 64b so I guess there
should be a 64b to 32b conversion involved in writing to a 32b result.
Seeing how 64b to 32b conversions require that we emit specific code
using ALIGN1 mode to deal with alignment requirements I guess it is not
too surprising that this does not work, right?

I can send you a trace if you want to have a look at it.

Iago

> > 
> > +      break;
> > +   }
> > +
> >     case nir_op_i2b:
> >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> >        break;