[Mesa-dev] [PATCH 33/95] i965/vec4: implement d2b

Iago Toral itoral at igalia.com
Thu Aug 18 08:02:04 UTC 2016


On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > 
> > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral at igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 1525a3d..4014020 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -1497,6 +1497,24 @@ vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > *instr)
> > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > BRW_CONDITIONAL_NZ));
> > > >        break;
> > > >  
> > > > +   case nir_op_d2b: {
> > > > +      /* two-argument instructions can't take 64-bit
> > > > immediates */
> > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > +
> > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > BRW_CONDITIONAL_NZ));
> > > > +
> > > > +      /* Convert the double CMP result to a single boolean
> > > > result.
> > > > For that
> > > > +       * we take the low 32-bit chunk of each DF component in
> > > > the
> > > > result.
> > > > +       * and do a final MOV to honor the original writemask
> > > > +       */
> > > > +      dst_reg result = dst_reg(this, glsl_type::bvec4_type);
> > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result, src_reg(tmp));
> > > > +      emit(MOV(dst, src_reg(result)));
> > > Couldn't you just do a single CMP instruction of the double-
> > > precision
> > > argument and a single-precision 0.0 immediate? 
> > It never occurred to me that we could do that but it seems to work
> > just
> > fine.
> > 
> > > 
> > >  I think you could
> > > potentially also use a 32bit destination type on the CMP
> > > instruction
> > > so
> > > you don't need to emit the PICK_LOW+MOV instructions afterwards.
> > This does not seem to work though.
> > 
> > > 
> > >   It may
> > > hit an instruction decompression bug but it could be cleaned up
> > > by
> > > the
> > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > uncompressed instructions instead of the two uncompressed plus
> > > two
> > > compressed instructions above.
> > I tried splitting the instruction just in case. Notice that doing
> > this
> > requires that we improve the splitting pass to support splitting
> > instructions that write only one register (the original series did
> > not
> > implement support for that because so far the only cared to split
> > DF
> > instructions that wrote more than one register). I implemented a
> > couple
> > of quick hacks to get this done so I could test this but the result
> > is
> > still incorrect.
> > 
> > For reference, if we don't split, the resulting CMP looks like this
> > (after full scalarization):
> > 
> > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > 
> > And if we split the instruction in two we produce this:
> > 
> > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N };
> > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N };
> > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N };
> > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N };
> > 
> > Both sequences of instructions look correct to me as far as the
> > splitting and DF regioning go, so I guess the 32-bit CMP result is
> > not
> > doing what we want.
> > 
> > I am not too surprised about this. Even if we use a 32b dst type I
> > suppose the instruction execution type is still 64b so I guess
> > there
> > should be a 64b to 32b conversion involved in writing to a 32b
> > result.
> > Seeing how 64b to 32b conversions require that we emit specific
> > code
> > using ALIGN1 mode to deal with alignment requirements I guess it is
> > not
> > too surprising that this does not work, right?
> > 
> Yeah, that's not too surprising, don't worry about it.  Another idea
> would be to do something along the lines of 'MOV.nz null, <df source
> goes here>' to check whether the source is non-zero (so you avoid
> loading the immediate which is a non-trivial operation on Gen7), and
> then use a single-precision SEL instruction (or two predicated MOV
> instructions) to write true or false to the destination (so you don't
> need the PICK_LOW+MOV sequence).

Using MOV.nz seems to have a different behavior than CMP.nz. The former
will consider anything other than 0.0 to be non-zero, whereas the CMP
version will consider small enough DF values to be zero as well, which
I think is the behavior we want (I think I remember discussing this
here when we were implementing this for broadwell).

Fortunately, we can do something like this:

   src_reg one = src_reg(this, glsl_type::ivec4_type);
   emit(MOV(dst_reg(one), brw_imm_d(~0)));

   vec4_instruction *inst =
      emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
BRW_CONDITIONAL_NZ));
   inst->regs_written = 1;

   inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
   inst->predicate = BRW_PREDICATE_NORMAL;

which is pretty much the same idea (notice that we use a float constant
for the CMP) and keeps the original behavior.

If you are curious about the inst->regs_written=1 in the CMP it is just
because the vec4_constructor will figure out that because it has a DF
destination it writes two registers, but it really doesn't because it
is a null dst register. If we leave it at 2, then the simd lowering
pass will split the CMP instruction because the 0.0f constant argument
only reads one register. Maybe we should make vec4_instruction set only
1 register written for null dsts or just make the simd lowering pass
ignore the number of registers written for instructions with a null
dst?

Iago

> > 
> > I can send you a trace if you want to have a look at it.
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > +      break;
> > > > +   }
> > > > +
> > > >     case nir_op_i2b:
> > > >        emit(CMP(dst, op[0], brw_imm_d(0), BRW_CONDITIONAL_NZ));
> > > >        break;


More information about the mesa-dev mailing list