[Mesa-dev] [PATCH 62/95] i965/vec4: Add a shuffle_64bit_data helper

Wed Sep 14 07:34:54 UTC 2016

On Tue, 2016-09-13 at 22:24 -0700, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > 
> > On Mon, 2016-09-12 at 14:19 -0700, Francisco Jerez wrote:
> > > 
> > > Iago Toral Quiroga <itoral at igalia.com> writes:
> > > 
> > > > 
> > > > 
> > > > SIMD4x2 64bit data is stored in register space like this:
> > > > 
> > > > r0.0:DF  x0 y0 z0 w0
> > > > r0.1:DF  x1 y1 z1 w1
> > > > 
> > > > When we need to write data such as this to memory using 32-bit
> > > > write
> > > > messages we need to shuffle it in this fashion:
> > > > 
> > > > r0.0:DF  x0 y0 x1 y1
> > > > r0.1:DF  z0 w0 z1 w1
> > > > 
> > > > and emit two 32-bit write messages, one for r0.0 at base_offset
> > > > and another one for r0.1 at base_offset+16.
> > > > 
> > > > We also need to do the inverse operation when we read using 32-
> > > > bit
> > > > messages
> > > > to produce valid SIMD4x2 64bit data from the data read. We can
> > > > achieve this
> > > > by aplying the exact same shuffling to the data read, although
> > > > we
> > > > need to
> > > > apply different channel enables since the layout of the data is
> > > > reversed.
> > > > 
> > > > This helper implements the data shuffling logic and we will use
> > > > it
> > > > in
> > > > various places where we read and write 64bit data from/to
> > > > memory.
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
> > > >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96
> > > > ++++++++++++++++++++++++++++++
> > > >  2 files changed, 101 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > b/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > index 26228d0..3337fc0 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> > > > @@ -327,6 +327,11 @@ public:
> > > >  
> > > >     src_reg setup_imm_df(double v);
> > > >  
> > > > +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg
> > > > src,
> > > > +                                        bool for_write,
> > > > +                                        bblock_t *block =
> > > > NULL,
> > > > +                                        vec4_instruction *ref
> > > > =
> > > > NULL);
> > > > +
> > > >     virtual void emit_nir_code();
> > > >     virtual void nir_setup_uniforms();
> > > >     virtual void
> > > > nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > index 450db92..346e822 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > > > @@ -2145,4 +2145,100 @@
> > > > vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
> > > >        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
> > > >  }
> > > >  
> > > > +/* SIMD4x2 64bit data is stored in register space like this:
> > > > + *
> > > > + * r0.0:DF  x0 y0 z0 w0
> > > > + * r0.1:DF  x1 y1 z1 w1
> > > > + *
> > > > + * When we need to write data such as this to memory using 32-
> > > > bit
> > > > write
> > > > + * messages we need to shuffle it in this fashion:
> > > > + *
> > > > + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
> > > > + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
> > > > + *
> > > > + * We need to do the inverse operation when we read using 32-
> > > > bit
> > > > messages,
> > > > + * which we can do by applying the same exact shuffling on the
> > > > 64-
> > > > bit data
> > > > + * read, only that because the data for each vertex is
> > > > positioned
> > > > differently
> > > > + * we need to apply different channel enables.
> > > > + *
> > > > + * This function takes 64bit data and shuffles it as explained
> > > > above.
> > > > + *
> > > > + * The @for_write parameter is used to specify if the
> > > > shuffling is
> > > > being done
> > > > + * for proper SIMD4x2 64-bit data that needs to be shuffled
> > > > prior
> > > > to a 32-bit
> > > > + * write message (for_write = true), or instead we are doing
> > > > the
> > > > inverse
> > > > + * opperation and we have just read 64-bit data using a 32-bit
> > > > messages that we
> > > > + * need to shuffle to create valid SIMD4x2 64-bit data
> > > > (for_write
> > > > = false).
> > > > + *
> > > > + * If @block and @ref are non-NULL, then the shuffling is done
> > > > after @ref,
> > > > + * otherwise the instructions are emitted normally at the end.
> > > > The
> > > > function
> > > > + * returns the last instruction inserted.
> > > > + *
> > > > + * Notice that @src and @dst cannot be the same register.
> > > > + */
> > > > +vec4_instruction *
> > > > +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src,
> > > > bool
> > > > for_write,
> > > > +                                 bblock_t *block,
> > > > vec4_instruction
> > > > *ref)
> > > > +{
> > > > +   assert(type_sz(src.type) == 8);
> > > > +   assert(type_sz(dst.type) == 8);
> > > > +   assert(!src.in_range(dst, 2));
> > > > +   assert(dst.writemask == WRITEMASK_XYZW);
> > > > +   assert(!ref == !block);
> > > > +
> > > > +   vec4_instruction *inst, *last;
> > > > +   bool emit_before = ref != NULL;
> > > > +
> > > > +   #define EMIT(i) \
> > > > +      if (!emit_before) { \
> > > > +         emit(i); \
> > > > +      } else { \
> > > > +         ref->insert_after(block, i); \
> > > > +         ref = i; \
> > > > +      }  \
> > > > +      last = i;
> > > > +
> > > > +   /* Resolve swizzle in src */
> > > > +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
> > > > +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
> > > > +      inst = MOV(data, src);
> > > > +      EMIT(inst);
> > > > +      src = src_reg(data);
> > > > +   }
> > > > +
> > > > +   /* dst+0.XY = src+0.XY */
> > > > +   dst.writemask = WRITEMASK_XY;
> > > > +   inst = MOV(dst, src);
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+0.ZW = src+1.XY */
> > > > +   dst.writemask = WRITEMASK_ZW;
> > > > +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = for_write ? 4 : 0;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+1.XY = src+0.ZW */
> > > > +   dst.writemask = WRITEMASK_XY;
> > > > +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = for_write ? 0 : 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   /* dst+1.ZW = src+1.ZW */
> > > > +   dst.writemask = WRITEMASK_ZW;
> > > > +   inst = MOV(offset(dst, 1), offset(src, 1));
> > > > +   inst->regs_written = 1;
> > > > +   inst->exec_size = 4;
> > > > +   inst->group = 4;
> > > > +   EMIT(inst);
> > > > +
> > > > +   #undef EMIT
> > > > +
> > > > +   return last;
> > > > +}
> > > > +
> > > This would be a *lot* simpler if you used and took as argument a
> > > VEC4
> > > builder instead of constructing the IR manually (and if you used
> > > the
> > > writemask() helper).  It might need some small fixes for the VEC4
> > > builder to initialize the new group and exec_size fields
> > > correctly
> > > but
> > > it seems worth doing just because of this function alone.
> > Sure, I can do that. About using the writemask() helper, the
> > problem
> > with it is that it doesn't set the input writemask on the dst,
> > instead
> > it sets the result of "dst.writemask & mask", and when we are
> > shuffling
> > we really want to force the writemask independently of any previous
> > writemask the dst register had.
> Why?  If I give a register r with writemask XY to a function 'foo', I
> wouldn't expect it to ignore the writemask I told it to use and
> instead
> use a different writemask not contained in the original
> writemask.  If
> you're doing that right now you likely have a bug.

Shuffled data is always written to a new vgrf because (so there is
never data to preserve in the dst) and it is only used specifically
right before a write where we honor the writemask on the dst of the
write operation or right after a read (where we the swizzle on the
source takes care of reading the data channels it needs from the
shuffled result), so I think this is probably fine.

> > 
> > Probably the dst we get here has an XYZW writemask anyway since it
> > would typically be a temporary vgrf we have just created to store
> > the
> > result of the shuffling operation. Maybe we should just add an
> > assert
> > to ensure that, and then we should be free to use the writemask()
> > helper.
> > 
> Looks like you have one already:
> 
> +   assert(dst.writemask == WRITEMASK_XYZW);
> 
> Question is do you need it only because you're bashing the writemasks
> directly instead of using writemask(), or is there any other reason
> for
> the assert?

I don't think we really need this as a requirement and we can honor any
writemask we get. That said, I don't think that would be immediately
useful in any scenario, at least right now.

As a side note, notice that using writemasks with shuffling operations
requires to think a bit about what we are really doing in each case: if
we use an XY writemask when we shuffle before a write, we are shuffling
XYZW components of just one (the first) of the SIMD4x2 threads. If we
do the same thing when we shuffle the result of a read then we are
writing components XY of both threads.

Either way, if some code path really wants to shuffle with a writemask
set and it knows what it is doing I guess there is no reason to stand
in the way.

> > 
> > > 
> > > > 
> > > > 
> > > >  }