[Mesa-dev] [PATCH 33/95] i965/vec4: implement d2b

Thu Aug 18 06:26:31 UTC 2016

On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > 
> > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral at igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 1525a3d..4014020 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > *instr)
> > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > BRW_CONDITIONAL_NZ));
> > > >        break;
> > > >  
> > > > +   case nir_op_d2b: {
> > > > +      /* two-argument instructions can't take 64-bit
> > > > immediates */
> > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > +
> > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > BRW_CONDITIONAL_NZ));
> > > > +
> > > > +      /* Convert the double CMP result to a single boolean
> > > > result.
> > > > For that
> > > > +       * we take the low 32-bit chunk of each DF component in
> > > > the
> > > > result.
> > > > +       * and do a final MOV to honor the original writemask
> > > > +       */
> > > > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > > > +      emit(MOV(dst, src_reg(result)));
> > > Couldn't you just do a single CMP instruction of the double-
> > > precision
> > > argument and a single-precision 0.0 immediate? 
> > It never occurred to me that we could do that but it seems to work
> > just
> > fine.
> > 
> > > 
> > >  I think you could
> > > potentially also use a 32bit destination type on the CMP
> > > instruction
> > > so
> > > you don't need to emit the PICK_LOW+MOV instructions afterwards.
> > This does not seem to work though.
> > 
> > > 
> > >   It may
> > > hit an instruction decompression bug but it could be cleaned up
> > > by
> > > the
> > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > uncompressed instructions instead of the two uncompressed plus
> > > two
> > > compressed instructions above.
> > I tried splitting the instruction just in case. Notice that doing
> > this
> > requires that we improve the splitting pass to support splitting
> > instructions that write only one register (the original series did
> > not
> > implement support for that because so far the only cared to split
> > DF
> > instructions that wrote more than one register). I implemented a
> > couple
> > of quick hacks to get this done so I could test this but the result
> > is
> > still incorrect.
> > 
> > For reference, if we don't split, the resulting CMP looks like this
> > (after full scalarization):
> > 
> > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > 
> > And if we split the instruction in two we produce this:
> > 
> > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
> > 
> > Both sequences of instructions look correct to me as far as the
> > splitting and DF regioning go, so I guess the 32-bit CMP result is
> > not
> > doing what we want.
> > 
> > I am not too surprised about this. Even if we use a 32b dst type I
> > suppose the instruction execution type is still 64b so I guess
> > there
> > should be a 64b to 32b conversion involved in writing to a 32b
> > result.
> > Seeing how 64b to 32b conversions require that we emit specific
> > code
> > using ALIGN1 mode to deal with alignment requirements I guess it is
> > not
> > too surprising that this does not work, right?
> > 
> Yeah, that's not too surprising, don't worry about it.  Another idea
> would be to do something along the lines of 'MOV.nz null, <df source
> goes here>' to check whether the source is non-zero (so you avoid
> loading the immediate which is a non-trivial operation on Gen7), and
> then use a single-precision SEL instruction (or two predicated MOV
> instructions) to write true or false to the destination (so you don't
> need the PICK_LOW+MOV sequence).

Yeah, that sounds like a good idea, I'll give it a try. Thanks!

Iago

> > 
> > I can send you a trace if you want to have a look at it.
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > +      break;
> > > > +   }
> > > > +
> > > >     case nir_op_i2b:
> > > >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> > > >        break;