[Mesa-dev] [PATCH 62/95] i965/vec4: Add a shuffle_64bit_data helper

Tue Sep 13 07:38:56 UTC 2016

On Mon, 2016-09-12 at 14:19 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga <itoral at igalia.com> writes:
> 
> > 
> > SIMD4x2 64bit data is stored in register space like this:
> > 
> > r0.0:DF  x0 y0 z0 w0
> > r0.1:DF  x1 y1 z1 w1
> > 
> > When we need to write data such as this to memory using 32-bit
> > write
> > messages we need to shuffle it in this fashion:
> > 
> > r0.0:DF  x0 y0 x1 y1
> > r0.1:DF  z0 w0 z1 w1
> > 
> > and emit two 32-bit write messages, one for r0.0 at base_offset
> > and another one for r0.1 at base_offset+16.
> > 
> > We also need to do the inverse operation when we read using 32-bit
> > messages
> > to produce valid SIMD4x2 64bit data from the data read. We can
> > achieve this
> > by aplying the exact same shuffling to the data read, although we
> > need to
> > apply different channel enables since the layout of the data is
> > reversed.
> > 
> > This helper implements the data shuffling logic and we will use it
> > in
> > various places where we read and write 64bit data from/to memory.
> > ---
> >  src/mesa/drivers/dri/i965/brw_vec4.h       |  5 ++
> >  src/mesa/drivers/dri/i965/brw_vec4_nir.cpp | 96
> > ++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4.h
> > b/src/mesa/drivers/dri/i965/brw_vec4.h
> > index 26228d0..3337fc0 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4.h
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4.h
> > @@ -327,6 +327,11 @@ public:
> >  
> >     src_reg setup_imm_df(double v);
> >  
> > +   vec4_instruction *shuffle_64bit_data(dst_reg dst, src_reg src,
> > +                                        bool for_write,
> > +                                        bblock_t *block = NULL,
> > +                                        vec4_instruction *ref =
> > NULL);
> > +
> >     virtual void emit_nir_code();
> >     virtual void nir_setup_uniforms();
> >     virtual void
> > nir_setup_system_value_intrinsic(nir_intrinsic_instr *instr);
> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > index 450db92..346e822 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_nir.cpp
> > @@ -2145,4 +2145,100 @@
> > vec4_visitor::nir_emit_undef(nir_ssa_undef_instr *instr)
> >        dst_reg(VGRF, alloc.allocate(instr->def.bit_size / 32));
> >  }
> >  
> > +/* SIMD4x2 64bit data is stored in register space like this:
> > + *
> > + * r0.0:DF  x0 y0 z0 w0
> > + * r0.1:DF  x1 y1 z1 w1
> > + *
> > + * When we need to write data such as this to memory using 32-bit
> > write
> > + * messages we need to shuffle it in this fashion:
> > + *
> > + * r0.0:DF  x0 y0 x1 y1 (to be written at base offset)
> > + * r0.0:DF  z0 w0 z1 w1 (to be written at base offset + 16)
> > + *
> > + * We need to do the inverse operation when we read using 32-bit
> > messages,
> > + * which we can do by applying the same exact shuffling on the 64-
> > bit data
> > + * read, only that because the data for each vertex is positioned
> > differently
> > + * we need to apply different channel enables.
> > + *
> > + * This function takes 64bit data and shuffles it as explained
> > above.
> > + *
> > + * The @for_write parameter is used to specify if the shuffling is
> > being done
> > + * for proper SIMD4x2 64-bit data that needs to be shuffled prior
> > to a 32-bit
> > + * write message (for_write = true), or instead we are doing the
> > inverse
> > + * opperation and we have just read 64-bit data using a 32-bit
> > messages that we
> > + * need to shuffle to create valid SIMD4x2 64-bit data (for_write
> > = false).
> > + *
> > + * If @block and @ref are non-NULL, then the shuffling is done
> > after @ref,
> > + * otherwise the instructions are emitted normally at the end. The
> > function
> > + * returns the last instruction inserted.
> > + *
> > + * Notice that @src and @dst cannot be the same register.
> > + */
> > +vec4_instruction *
> > +vec4_visitor::shuffle_64bit_data(dst_reg dst, src_reg src, bool
> > for_write,
> > +                                 bblock_t *block, vec4_instruction
> > *ref)
> > +{
> > +   assert(type_sz(src.type) == 8);
> > +   assert(type_sz(dst.type) == 8);
> > +   assert(!src.in_range(dst, 2));
> > +   assert(dst.writemask == WRITEMASK_XYZW);
> > +   assert(!ref == !block);
> > +
> > +   vec4_instruction *inst, *last;
> > +   bool emit_before = ref != NULL;
> > +
> > +   #define EMIT(i) \
> > +      if (!emit_before) { \
> > +         emit(i); \
> > +      } else { \
> > +         ref->insert_after(block, i); \
> > +         ref = i; \
> > +      }  \
> > +      last = i;
> > +
> > +   /* Resolve swizzle in src */
> > +   if (src.swizzle != BRW_SWIZZLE_XYZW) {
> > +      dst_reg data = dst_reg(this, glsl_type::dvec4_type);
> > +      inst = MOV(data, src);
> > +      EMIT(inst);
> > +      src = src_reg(data);
> > +   }
> > +
> > +   /* dst+0.XY = src+0.XY */
> > +   dst.writemask = WRITEMASK_XY;
> > +   inst = MOV(dst, src);
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   EMIT(inst);
> > +
> > +   /* dst+0.ZW = src+1.XY */
> > +   dst.writemask = WRITEMASK_ZW;
> > +   inst = MOV(dst, swizzle(offset(src, 1), BRW_SWIZZLE_XYXY));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = for_write ? 4 : 0;
> > +   EMIT(inst);
> > +
> > +   /* dst+1.XY = src+0.ZW */
> > +   dst.writemask = WRITEMASK_XY;
> > +   inst = MOV(offset(dst, 1), swizzle(src, BRW_SWIZZLE_ZWZW));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = for_write ? 0 : 4;
> > +   EMIT(inst);
> > +
> > +   /* dst+1.ZW = src+1.ZW */
> > +   dst.writemask = WRITEMASK_ZW;
> > +   inst = MOV(offset(dst, 1), offset(src, 1));
> > +   inst->regs_written = 1;
> > +   inst->exec_size = 4;
> > +   inst->group = 4;
> > +   EMIT(inst);
> > +
> > +   #undef EMIT
> > +
> > +   return last;
> > +}
> > +
> This would be a *lot* simpler if you used and took as argument a VEC4
> builder instead of constructing the IR manually (and if you used the
> writemask() helper).  It might need some small fixes for the VEC4
> builder to initialize the new group and exec_size fields correctly
> but
> it seems worth doing just because of this function alone.

Sure, I can do that. About using the writemask() helper, the problem
with it is that it doesn't set the input writemask on the dst, instead
it sets the result of "dst.writemask & mask", and when we are shuffling
we really want to force the writemask independently of any previous
writemask the dst register had. Probably the dst we get here has an
XYZW writemask anyway since it would typically be a temporary vgrf we
have just created to store the result of the shuffling operation. Maybe
we should just add an assert to ensure that, and then we should be free
to use the writemask() helper.

> > 
> >  }