[Mesa-dev] [PATCH 33/95] i965/vec4: implement d2b

Fri Aug 19 07:13:32 UTC 2016

On Fri, 2016-08-19 at 08:50 +0200, Iago Toral wrote:
> On Thu, 2016-08-18 at 12:08 -0700, Francisco Jerez wrote:
> > 
> > Iago Toral <itoral at igalia.com> writes:
> > 
> > > 
> > > 
> > > On Wed, 2016-08-17 at 15:15 -0700, Francisco Jerez wrote:
> > > > 
> > > > 
> > > > Iago Toral <itoral at igalia.com> writes:
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > On Tue, 2016-08-02 at 18:27 -0700, Francisco Jerez wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Iago Toral Quiroga <itoral at igalia.com> writes:
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > ---
> > > > > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 18
> > > > > > > ++++++++++++++++++
> > > > > > >  1 file changed, 18 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > index 1525a3d..4014020 100644
> > > > > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > > > > @@ -1497,6 +1497,24 @@
> > > > > > > vec4_visitor::nir_emit_alu(nir_alu_instr
> > > > > > > *instr)
> > > > > > >        emit(CMP(dst, op[0], brw_imm_f(0.0f),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > >        break;
> > > > > > >  
> > > > > > > +   case nir_op_d2b: {
> > > > > > > +      /* two-argument instructions can't take 64-bit
> > > > > > > immediates */
> > > > > > > +      dst_reg zero = dst_reg(this,
> > > > > > > glsl_type::dvec4_type);
> > > > > > > +      emit(MOV(zero, brw_imm_df(0.0)));
> > > > > > > +
> > > > > > > +      dst_reg tmp = dst_reg(this,
> > > > > > > glsl_type::dvec4_type);
> > > > > > > +      emit(CMP(tmp, op[0], src_reg(zero),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > > +
> > > > > > > +      /* Convert the double CMP result to a single
> > > > > > > boolean
> > > > > > > result.
> > > > > > > For that
> > > > > > > +       * we take the low 32-bit chunk of each DF
> > > > > > > component
> > > > > > > in
> > > > > > > the
> > > > > > > result.
> > > > > > > +       * and do a final MOV to honor the original
> > > > > > > writemask
> > > > > > > +       */
> > > > > > > +      dst_reg result = dst_reg(this,
> > > > > > > glsl_type::bvec4_type);
> > > > > > > +      emit(VEC4_OPCODE_PICK_LOW_32BIT, result,
> > > > > > > src_reg(tmp));
> > > > > > > +      emit(MOV(dst, src_reg(result)));
> > > > > > Couldn't you just do a single CMP instruction of the
> > > > > > double-
> > > > > > precision
> > > > > > argument and a single-precision 0.0 immediate? 
> > > > > It never occurred to me that we could do that but it seems to
> > > > > work
> > > > > just
> > > > > fine.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  I think you could
> > > > > > potentially also use a 32bit destination type on the CMP
> > > > > > instruction
> > > > > > so
> > > > > > you don't need to emit the PICK_LOW+MOV instructions
> > > > > > afterwards.
> > > > > This does not seem to work though.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >   It may
> > > > > > hit an instruction decompression bug but it could be
> > > > > > cleaned
> > > > > > up
> > > > > > by
> > > > > > the
> > > > > > SIMD lowering pass afterwards, AFAICT the result would be
> > > > > > two
> > > > > > uncompressed instructions instead of the two uncompressed
> > > > > > plus
> > > > > > two
> > > > > > compressed instructions above.
> > > > > I tried splitting the instruction just in case. Notice that
> > > > > doing
> > > > > this
> > > > > requires that we improve the splitting pass to support
> > > > > splitting
> > > > > instructions that write only one register (the original
> > > > > series
> > > > > did
> > > > > not
> > > > > implement support for that because so far the only cared to
> > > > > split
> > > > > DF
> > > > > instructions that wrote more than one register). I
> > > > > implemented
> > > > > a
> > > > > couple
> > > > > of quick hacks to get this done so I could test this but the
> > > > > result
> > > > > is
> > > > > still incorrect.
> > > > > 
> > > > > For reference, if we don't split, the resulting CMP looks
> > > > > like
> > > > > this
> > > > > (after full scalarization):
> > > > > 
> > > > > cmp.nz.f0(8)  g8<1>.xD  g6<2,2,1>.xyxyDF 0F     { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.yD  g6<2,2,1>.zwzwDF 0F     { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.zD  g6.2<0,2,1>.xyxyDF 0F   { align16 1Q
> > > > > };
> > > > > cmp.nz.f0(8)  g8<1>.wD  g6.2<0,2,1>.zwzwDF 0F   { align16 1Q
> > > > > };
> > > > > 
> > > > > And if we split the instruction in two we produce this:
> > > > > 
> > > > > cmp.nz.f0(4)  g6<1>.xD     g1<0,2,1>.xyxyDF 0F    { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.yD     g1<0,2,1>.zwzwDF 0F    { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.zD     g1.2<0,2,1>.xyxyDF 0F  { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6<1>.wD     g1.2<0,2,1>.zwzwDF 0F  { align16
> > > > > 1N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.xD   g1<0,2,1>.xyxyDF 0F    { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.yD   g1<0,2,1>.zwzwDF 0F    { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.zD   g1.2<0,2,1>.xyxyDF 0F  { align16
> > > > > 2N
> > > > > };
> > > > > cmp.nz.f0(4)  g6.4<1>.wD   g1.2<0,2,1>.zwzwDF 0F  { align16
> > > > > 2N
> > > > > };
> > > > > 
> > > > > Both sequences of instructions look correct to me as far as
> > > > > the
> > > > > splitting and DF regioning go, so I guess the 32-bit CMP
> > > > > result
> > > > > is
> > > > > not
> > > > > doing what we want.
> > > > > 
> > > > > I am not too surprised about this. Even if we use a 32b dst
> > > > > type I
> > > > > suppose the instruction execution type is still 64b so I
> > > > > guess
> > > > > there
> > > > > should be a 64b to 32b conversion involved in writing to a
> > > > > 32b
> > > > > result.
> > > > > Seeing how 64b to 32b conversions require that we emit
> > > > > specific
> > > > > code
> > > > > using ALIGN1 mode to deal with alignment requirements I guess
> > > > > it is
> > > > > not
> > > > > too surprising that this does not work, right?
> > > > > 
> > > > Yeah, that's not too surprising, don't worry about it.  Another
> > > > idea
> > > > would be to do something along the lines of 'MOV.nz null, <df
> > > > source
> > > > goes here>' to check whether the source is non-zero (so you
> > > > avoid
> > > > loading the immediate which is a non-trivial operation on
> > > > Gen7),
> > > > and
> > > > then use a single-precision SEL instruction (or two predicated
> > > > MOV
> > > > instructions) to write true or false to the destination (so you
> > > > don't
> > > > need the PICK_LOW+MOV sequence).
> > > Using MOV.nz seems to have a different behavior than CMP.nz. The
> > > former
> > > will consider anything other than 0.0 to be non-zero, whereas the
> > > CMP
> > > version will consider small enough DF values to be zero as well,
> > > which
> > > I think is the behavior we want (I think I remember discussing
> > > this
> > > here when we were implementing this for broadwell).
> > > 
> > What do you mean by small enough?  Denormalized? 
> Yes, denormalized values, sorry I wasn't more clear.
> 
> > 
> >  MOV instructions can
> > flush them to zero as well, only raw moves won't, you can make sure
> > that
> > the MOV is not raw by applying a source modifier to the source
> > which
> > shouldn't affect the flag result for normalized sources.
> Oh, didn't know that, I can try this.

This works. I'll go with this approach instead of the CMP with float
constant solution since we can explain exactly why this solution works.

Iago

> > 
> > > 
> > > 
> > > Fortunately, we can do something like this:
> > > 
> > >    src_reg one = src_reg(this, glsl_type::ivec4_type);
> > >    emit(MOV(dst_reg(one), brw_imm_d(~0)));
> > > 
> > >    vec4_instruction *inst =
> > >       emit(CMP(dst_null_df(), op[0], brw_imm_f(0.0f),
> > > BRW_CONDITIONAL_NZ));
> > >    inst->regs_written = 1;
> > > 
> > >    inst = emit(BRW_OPCODE_SEL, dst, one, brw_imm_d(0));
> > >    inst->predicate = BRW_PREDICATE_NORMAL;
> > > 
> > > which is pretty much the same idea (notice that we use a float
> > > constant
> > > for the CMP) and keeps the original behavior.
> > > 
> > Hm, I thought you said that using an F immediate on a double-
> > precision
> > instruction doesn't work?
> No, that seems to work fine, at least if the float constant is 0.0 it
> seems to produce correct results in all the cases I tested. What
> doesn't work is the part where we also make the CMP produce a 32-bit
> result.
> 
> > 
> > > 
> > > 
> > > If you are curious about the inst->regs_written=1 in the CMP it
> > > is
> > > just because the vec4_constructor will figure out that because it
> > > has
> > > a DF destination it writes two registers, but it really doesn't
> > > because it is a null dst register.  If we leave it at 2, then the
> > > simd
> > > lowering pass will split the CMP instruction because the 0.0f
> > > constant
> > > argument only reads one register.
> > That seems like a bug.  There is no need to lower an instruction
> > with
> > a
> > mismatch in the number of registers read and written if the source
> > is
> > uniform:
> > 
> > > 
> > > 
> > >  When destination spans two registers, the source MUST span two
> > >  registers. The exception to the above rule:
> > > 
> > >   - When source is scalar, the source registers are not
> > > incremented.
> > >   - When source is packed integer Word and destination is
> > >     packed integer DWord, the source register is not incremented
> > >     but the source sub register is incremented.
> > The FS back-end implements both exceptions, but there's probably no
> > need
> > to implement the second one in the VEC4 back-end.
> > 
> > > 
> > > 
> > > Maybe we should make vec4_instruction set only 1 register
> > > written for null dsts or just make the simd lowering pass ignore
> > > the
> > > number of registers written for instructions with a null dst?
> > > 
> > > Iago
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > I can send you a trace if you want to have a look at it.
> > > > > 
> > > > > Iago
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > +      break;
> > > > > > > +   }
> > > > > > > +
> > > > > > >     case nir_op_i2b:
> > > > > > >        emit(CMP(dst, op[0], brw_imm_d(0),
> > > > > > > BRW_CONDITIONAL_NZ));
> > > > > > >        break;