[Mesa-dev] [PATCH 33/95] i965/vec4: implement d2b

Fri Aug 19 06:50:42 UTC 2016

On Thu, 2016-08-18 at 12:08 -0700, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > 
> > On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral <itoral at igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > > > 
> > > > > 
> > > > > Iago Toral Quiroga <itoral at igalia.com> writes:
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ---
> > > > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > > > ++++++++++++++++++
> > > > > >  1 file changed, 18 insertions(+)
> > > > > > 
> > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > index 1525a3d..4014020 100644
> > > > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > @@ -1497,6 +1497,24 @@
> > > > > > vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > > > *instr)
> > > > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > >        break;
> > > > > >  
> > > > > > +   case nir_op_d2b: {
> > > > > > +      /* two-argument instructions can't take 64-bit
> > > > > > immediates */
> > > > > > +      dst_reg zero = dst_reg(this, glsl_type::dvec4_type);
> > > > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > > > +
> > > > > > +      dst_reg tmp = dst_reg(this, glsl_type::dvec4_type);
> > > > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > > +
> > > > > > +      /* Convert the double CMP result to a single boolean
> > > > > > result.
> > > > > > For that
> > > > > > +       * we take the low 32-bit chunk of each DF component
> > > > > > in
> > > > > > the
> > > > > > result.
> > > > > > +       * and do a final MOV to honor the original
> > > > > > writemask
> > > > > > +       */
> > > > > > +      dst_reg result = dst_reg(this,
> > > > > > glsl_type::bvec4_type);
> > > > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result,
> > > > > > src_reg(tmp));
> > > > > > +      emit(MOV(dst, src_reg(result)));
> > > > > Couldn't you just do a single CMP instruction of the double-
> > > > > precision
> > > > > argument and a single-precision 0.0 immediate? 
> > > > It never occurred to me that we could do that but it seems to
> > > > work
> > > > just
> > > > fine.
> > > > 
> > > > > 
> > > > > 
> > > > >  I think you could
> > > > > potentially also use a 32bit destination type on the CMP
> > > > > instruction
> > > > > so
> > > > > you don't need to emit the PICK_LOW+MOV instructions
> > > > > afterwards.
> > > > This does not seem to work though.
> > > > 
> > > > > 
> > > > > 
> > > > >   It may
> > > > > hit an instruction decompression bug but it could be cleaned
> > > > > up
> > > > > by
> > > > > the
> > > > > SIMD lowering pass afterwards, AFAICT the result would be two
> > > > > uncompressed instructions instead of the two uncompressed
> > > > > plus
> > > > > two
> > > > > compressed instructions above.
> > > > I tried splitting the instruction just in case. Notice that
> > > > doing
> > > > this
> > > > requires that we improve the splitting pass to support
> > > > splitting
> > > > instructions that write only one register (the original series
> > > > did
> > > > not
> > > > implement support for that because so far the only cared to
> > > > split
> > > > DF
> > > > instructions that wrote more than one register). I implemented
> > > > a
> > > > couple
> > > > of quick hacks to get this done so I could test this but the
> > > > result
> > > > is
> > > > still incorrect.
> > > > 
> > > > For reference, if we don't split, the resulting CMP looks like
> > > > this
> > > > (after full scalarization):
> > > > 
> > > > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q };
> > > > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q };
> > > > 
> > > > And if we split the instruction in two we produce this:
> > > > 
> > > > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16 1N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16 2N
> > > > };
> > > > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16 2N
> > > > };
> > > > 
> > > > Both sequences of instructions look correct to me as far as the
> > > > splitting and DF regioning go, so I guess the 32-bit CMP result
> > > > is
> > > > not
> > > > doing what we want.
> > > > 
> > > > I am not too surprised about this. Even if we use a 32b dst
> > > > type I
> > > > suppose the instruction execution type is still 64b so I guess
> > > > there
> > > > should be a 64b to 32b conversion involved in writing to a 32b
> > > > result.
> > > > Seeing how 64b to 32b conversions require that we emit specific
> > > > code
> > > > using ALIGN1 mode to deal with alignment requirements I guess
> > > > it is
> > > > not
> > > > too surprising that this does not work, right?
> > > > 
> > > Yeah, that's not too surprising, don't worry about it.  Another
> > > idea
> > > would be to do something along the lines of 'MOV.nz null, <df
> > > source
> > > goes here>' to check whether the source is non-zero (so you avoid
> > > loading the immediate which is a non-trivial operation on Gen7),
> > > and
> > > then use a single-precision SEL instruction (or two predicated
> > > MOV
> > > instructions) to write true or false to the destination (so you
> > > don't
> > > need the PICK_LOW+MOV sequence).
> > Using MOV.nz seems to have a different behavior than CMP.nz. The
> > former
> > will consider anything other than 0.0 to be non-zero, whereas the
> > CMP
> > version will consider small enough DF values to be zero as well,
> > which
> > I think is the behavior we want (I think I remember discussing this
> > here when we were implementing this for broadwell).
> > 
> What do you mean by small enough?  Denormalized? 

Yes, denormalized values, sorry I wasn't more clear.

>  MOV instructions can
> flush them to zero as well, only raw moves won't, you can make sure
> that
> the MOV is not raw by applying a source modifier to the source which
> shouldn't affect the flag result for normalized sources.

Oh, didn't know that, I can try this.

> > 
> > Fortunately, we can do something like this:
> > 
> >    src_reg one = src_reg(this, glsl_type::ivec4_type);
> >    emit(MOV(dst_reg(one), brw_imm_d(~0)));
> > 
> >    vec4_instruction *inst =
> >       emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
> > BRW_CONDITIONAL_NZ));
> >    inst->regs_written = 1;
> > 
> >    inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
> >    inst->predicate = BRW_PREDICATE_NORMAL;
> > 
> > which is pretty much the same idea (notice that we use a float
> > constant
> > for the CMP) and keeps the original behavior.
> > 
> Hm, I thought you said that using an F immediate on a double-
> precision
> instruction doesn't work?

No, that seems to work fine, at least if the float constant is 0.0 it
seems to produce correct results in all the cases I tested. What
doesn't work is the part where we also make the CMP produce a 32-bit
result.

> > 
> > If you are curious about the inst->regs_written=1 in the CMP it is
> > just because the vec4_constructor will figure out that because it
> > has
> > a DF destination it writes two registers, but it really doesn't
> > because it is a null dst register.  If we leave it at 2, then the
> > simd
> > lowering pass will split the CMP instruction because the 0.0f
> > constant
> > argument only reads one register.
> That seems like a bug.  There is no need to lower an instruction with
> a
> mismatch in the number of registers read and written if the source is
> uniform:
> 
> > 
> >  When destination spans two registers, the source MUST span two
> >  registers. The exception to the above rule:
> > 
> >   - When source is scalar, the source registers are not
> > incremented.
> >   - When source is packed integer Word and destination is
> >     packed integer DWord, the source register is not incremented
> >     but the source sub register is incremented.
> The FS back-end implements both exceptions, but there's probably no
> need
> to implement the second one in the VEC4 back-end.
> 
> > 
> > Maybe we should make vec4_instruction set only 1 register
> > written for null dsts or just make the simd lowering pass ignore
> > the
> > number of registers written for instructions with a null dst?
> > 
> > Iago
> > 
> > > 
> > > > 
> > > > 
> > > > I can send you a trace if you want to have a look at it.
> > > > 
> > > > Iago
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > +      break;
> > > > > > +   }
> > > > > > +
> > > > > >     case nir_op_i2b:
> > > > > >        emit(CMP(dst, op[0], brw_imm_d(0),
> > > > > > BRW_CONDITIONAL_NZ));
> > > > > >        break;