[Mesa-dev] [PATCH 2/6] i965/vec4/generator: use 1-Oword Block Read/Write messages for DF scratch writes/reads

Mon Jun 26 14:37:50 UTC 2017

On Fri, 2017-06-23 at 11:06 -0700, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
> > On Thu, 2017-06-22 at 16:25 -0700, Francisco Jerez wrote:
> > > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> > > 
> > > > Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
> > > > ---
> > > >  src/intel/compiler/brw_eu_defines.h          |   2 +
> > > >  src/intel/compiler/brw_shader.cpp            |   5 +
> > > >  src/intel/compiler/brw_vec4.cpp              |   7 ++
> > > >  src/intel/compiler/brw_vec4.h                |   8 ++
> > > >  src/intel/compiler/brw_vec4_generator.cpp    | 136
> > > > +++++++++++++++++++++++++++
> > > >  src/intel/compiler/brw_vec4_reg_allocate.cpp |   6 +-
> > > >  src/intel/compiler/brw_vec4_visitor.cpp      |  49 ++++++++++
> > > >  7 files changed, 212 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/src/intel/compiler/brw_eu_defines.h
> > > > b/src/intel/compiler/brw_eu_defines.h
> > > > index 1af835d47e..3c148de0fa 100644
> > > > --- a/src/intel/compiler/brw_eu_defines.h
> > > > +++ b/src/intel/compiler/brw_eu_defines.h
> > > > @@ -436,6 +436,8 @@ enum opcode {
> > > >     VEC4_OPCODE_PICK_HIGH_32BIT,
> > > >     VEC4_OPCODE_SET_LOW_32BIT,
> > > >     VEC4_OPCODE_SET_HIGH_32BIT,
> > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW,
> > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH,
> > > >  
> > > 
> > > What's the point of introducing two different opcodes with
> > > essentially
> > > the same semantics (read 32B worth of data) as the current
> > > SHADER_OPCODE_GEN4_SCRATCH_READ?
> > 
> > Originally I had only SHADER_OPCODE_GEN4_SCRATCH_READ but I changed
> > it
> > to don't allocate more registers than needed when doing scratch
> > write
> > of a partial DF write. Let me explain it:
> > 
> > When doing spilling, as DF instructions are both split and
> > scalarized,
> > we read the existing contents in scratch memory, overwrite them
> > with
> > the destination of the instruction, then emit scratch write.
> > Together
> > with the fact that I am not shuffling DF data, we only need to
> > allocate
> > 1 GRF to do so, instead of 2 (if I had emitted
> > SHADER_OPCODE_GEN4_SCRATCH_READ), when doing spilling on partial DF
> > writes.
> > 
> 
> Why would you need to allocate more GRFs for
> SHADER_OPCODE_GEN4_SCRATCH_READ?  It also only reads one register,
> which
> should be sufficient for a single scalarized instruction as long as
> you
> don't shuffle data around -- Have a look at how the FS back-end
> addresses this problem.
> 

OK

> > >   Is there any downside from using the
> > > current opcode with force_writemask_all?  If anything it would
> > > give
> > > you
> > > better performance because you'd only have to set up one header
> > > (which
> > > stalls the EU pipeline twice), send down one message to the
> > > dataport,
> > > and avoid stalling to shuffle the data around in the return
> > > payload
> > > (which prevents your two 1OWORD messages from being pipelined at
> > > all).
> > > 
> > 
> > Sorry, I am confused here. Do you mean using
> > SHADER_OPCODE_GEN4_SCRATCH_READ as-is, which emits a "OWord Dual
> > Block
> > Read" message (so only one message)?
> > 
> > If that's the case, then I should shuffle the destination data of
> > the
> > partial DF write, change the 1-Oword block write offsets and so
> > on...
> 
> Why would you need to shuffle any spilled data?  I don't think
> there's
> much of a benefit from shuffling since scratch overwrites need read
> the
> original data for the most part anyway because of writemasking.  In
> fact
> shuffling DF data is probably the reason things blow up right now
> whenever you have mixed DF and single-precision reads or writes to
> the
> same spilled variable, which I guess is the reason you need to look
> for
> those cases and mark them as no_spill...
> 

Right, I don't need to shuffle data for the scratch write.

> > in order to save it inside scratch memory in the proper place to
> > make
> > OWord Dual Block Read work. That would require to some extra
> > instructions, but I don't know if this would give better
> > performance
> > against current implementation or not.
> > 
> 
> I expect the most serious performance issue with the approach of this
> patch will be the sequence of non-pipelined single-oword reads, which
> means you get to pay for the EU-dataport roundtrip latency twice
> instead
> of once.
> 
> > Then, why do I need force_writemask=true when emitting
> > SHADER_OPCODE_GEN4_SCRATCH_READ?
> > 
> 
> Because you probably don't want to shuffle data in your scratch
> buffer,
> and you don't want the dataport to apply bogus 16B channel enables to
> your reads and writes.
> 

If we save the dvec4 data of a vertex altogether in consecutive 32
bytes in scratch memory (i.e. no need of shuffling and we use
force_writemask_all as you said), then we need to create a special case
for IVB and partial DFs reads on HSW+ when unspilling the data.

What I am thinking now is if the scratch write is done wisely, we can
write the data in the proper places for the two
SHADER_OPCODE_GEN4_SCRATCH_READ we use for unspill DF data: write each
XY components with the respective 1-OWord scratch write message and ZW
components with other 1-OWord scratch write messages with an offset of
32 bytes. Thanks to this, we don't need to touch the current code for
unspilling (which does data shuffling) and it allows us to do
unspilling on IVB and partial DF reads on HSW+ without any special
case.

If we choose the no-shuffling-at-all solution, this is an improvement
to what I have sent in this v1, but I am leaning toward the solution in
last paragraph because it re-uses existing code and simplifies the
changes, although we have some data shuffling overhead.

What do you think?

Sam

> > I can try this alternative solution if this is what you meant. It
> > has
> > the advantage of simplifying the changes a lot, which is always
> > great.
> > 
> > Sam
> > 
> > > >     FS_OPCODE_DDX_COARSE,
> > > >     FS_OPCODE_DDX_FINE,
> > > > diff --git a/src/intel/compiler/brw_shader.cpp
> > > > b/src/intel/compiler/brw_shader.cpp
> > > > index 53d0742d2e..248feacbd2 100644
> > > > --- a/src/intel/compiler/brw_shader.cpp
> > > > +++ b/src/intel/compiler/brw_shader.cpp
> > > > @@ -296,6 +296,11 @@ brw_instruction_name(const struct
> > > > gen_device_info *devinfo, enum opcode op)
> > > >     case FS_OPCODE_PACK:
> > > >        return "pack";
> > > >  
> > > > +
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +      return "gen4_scratch_read_1word_low";
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > +      return "gen4_scratch_read_1word_high";
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >        return "gen4_scratch_read";
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > diff --git a/src/intel/compiler/brw_vec4.cpp
> > > > b/src/intel/compiler/brw_vec4.cpp
> > > > index b443effca9..b6d409eea2 100644
> > > > --- a/src/intel/compiler/brw_vec4.cpp
> > > > +++ b/src/intel/compiler/brw_vec4.cpp
> > > > @@ -259,6 +259,8 @@ bool
> > > >  vec4_instruction::can_do_writemask(const struct
> > > > gen_device_info
> > > > *devinfo)
> > > >  {
> > > >     switch (opcode) {
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >     case VEC4_OPCODE_DOUBLE_TO_F32:
> > > >     case VEC4_OPCODE_DOUBLE_TO_D32:
> > > > @@ -335,6 +337,9 @@
> > > > vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
> > > >        return 1;
> > > >     case VS_OPCODE_PULL_CONSTANT_LOAD:
> > > >        return 2;
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > +      return 1;
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >        return 2;
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > @@ -2091,6 +2096,8 @@ get_lowered_simd_width(const struct
> > > > gen_device_info *devinfo,
> > > >  {
> > > >     /* Do not split some instructions that require special
> > > > handling
> > > > */
> > > >     switch (inst->opcode) {
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > >        return inst->exec_size;
> > > > diff --git a/src/intel/compiler/brw_vec4.h
> > > > b/src/intel/compiler/brw_vec4.h
> > > > index d828da02ea..a5b45aca21 100644
> > > > --- a/src/intel/compiler/brw_vec4.h
> > > > +++ b/src/intel/compiler/brw_vec4.h
> > > > @@ -214,6 +214,9 @@ public:
> > > >                          enum brw_conditional_mod condition);
> > > >     vec4_instruction *IF(enum brw_predicate predicate);
> > > >     EMIT1(SCRATCH_READ)
> > > > +   vec4_instruction *DF_IVB_SCRATCH_READ(const dst_reg &dst,
> > > > const
> > > > src_reg &src0,
> > > > +                                         bool low);
> > > > +
> > > >     EMIT2(SCRATCH_WRITE)
> > > >     EMIT3(LRP)
> > > >     EMIT1(BFREV)
> > > > @@ -294,6 +297,11 @@ public:
> > > >  			  dst_reg dst,
> > > >  			  src_reg orig_src,
> > > >  			  int base_offset);
> > > > +   void emit_1grf_df_ivb_scratch_read(bblock_t *block,
> > > > +                                      vec4_instruction *inst,
> > > > +                                      dst_reg temp, src_reg
> > > > orig_src,
> > > > +                                      int base_offset, bool
> > > > first_grf);
> > > > +
> > > >     void emit_scratch_write(bblock_t *block, vec4_instruction
> > > > *inst,
> > > >  			   int base_offset);
> > > >     void emit_pull_constant_load(bblock_t *block,
> > > > vec4_instruction
> > > > *inst,
> > > > diff --git a/src/intel/compiler/brw_vec4_generator.cpp
> > > > b/src/intel/compiler/brw_vec4_generator.cpp
> > > > index 334933d15a..3bb931385a 100644
> > > > --- a/src/intel/compiler/brw_vec4_generator.cpp
> > > > +++ b/src/intel/compiler/brw_vec4_generator.cpp
> > > > @@ -1133,6 +1133,73 @@ generate_unpack_flags(struct brw_codegen
> > > > *p,
> > > >  }
> > > >  
> > > >  static void
> > > > +generate_scratch_read_1oword(struct brw_codegen *p,
> > > > +                             vec4_instruction *inst,
> > > > +                             struct brw_reg dst,
> > > > +                             struct brw_reg index,
> > > > +                             bool low)
> > > > +{
> > > > +   const struct gen_device_info *devinfo = p->devinfo;
> > > > +
> > > > +   assert(devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > +          type_sz(dst.type) == 8);
> > > > +   brw_set_default_access_mode(p, BRW_ALIGN_1);
> > > > +   brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > +
> > > > +   if (!low) {
> > > > +      /* Read second GRF (offset in OWORDs) */
> > > > +      for (int i = 0; i < 2; i++) {
> > > > +         brw_oword_block_read_scratch(p,
> > > > +                                      dst,
> > > > +                                      brw_message_reg(inst-
> > > > > base_mrf),
> > > > 
> > > > +                                      1, 32*inst->offset +
> > > > 16*i +
> > > > 32, false, true);
> > > > +         if (i == 0) {
> > > > +            /* The scratch read message writes the 128 MSB
> > > > (OWORD1
> > > > HIGH) of
> > > > +             * the destination. We need to move them to dst.0
> > > > so
> > > > we can
> > > > +             * read the pending 128 bits without using a
> > > > temporary
> > > > register.
> > > > +             */
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > +            struct brw_reg tmp =
> > > > +               stride(suboffset(dst, 16 / type_sz(dst.type)),
> > > > +                      4, 4, 1);
> > > > +
> > > > +            brw_set_default_mask_control(p, true);
> > > > +            brw_MOV(p, dst, tmp);
> > > > +            brw_set_default_mask_control(p, inst-
> > > > > force_writemask_all);
> > > > 
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > +         }
> > > > +      }
> > > > +   } else {
> > > > +      /* Read first GRF (offset in OWORDs) */
> > > > +      for (int i = 1; i >= 0; i--) {
> > > > +         brw_oword_block_read_scratch(p,
> > > > +                                      dst,
> > > > +                                      brw_message_reg(inst-
> > > > > base_mrf),
> > > > 
> > > > +                                      1, 32*inst->offset +
> > > > 16*i,
> > > > true, false);
> > > > +
> > > > +         if (i == 1) {
> > > > +            /* The scratch read message writes the 128 LSB
> > > > (OWORD1
> > > > LOW) of
> > > > +             * the destination. We need to move them to dst.4
> > > > so
> > > > we can
> > > > +             * read the pending 128 bits without using a
> > > > temporary
> > > > register.
> > > > +             */
> > > > +            struct brw_reg tmp = stride(dst, 4, 4, 1);
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > +            brw_set_default_mask_control(p, true);
> > > > +            brw_MOV(p,
> > > > +                    suboffset(dst, 16 / type_sz(dst.type)),
> > > > +                    tmp);
> > > > +            brw_set_default_mask_control(p, inst-
> > > > > force_writemask_all);
> > > > 
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > +         }
> > > > +      }
> > > > +   }
> > > > +
> > > > +   brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
> > > > +   brw_set_default_access_mode(p, BRW_ALIGN_16);
> > > > +   return;
> > > > +}
> > > > +
> > > > +static void
> > > >  generate_scratch_read(struct brw_codegen *p,
> > > >                        vec4_instruction *inst,
> > > >                        struct brw_reg dst,
> > > > @@ -1143,6 +1210,16 @@ generate_scratch_read(struct brw_codegen
> > > > *p,
> > > >  
> > > >     gen6_resolve_implied_move(p, &header, inst->base_mrf);
> > > >  
> > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > +       type_sz(dst.type) == 8) {
> > > > +      /* First read second GRF (offset in OWORDs) */
> > > > +      struct brw_reg dst_high = suboffset(dst, 32 /
> > > > type_sz(dst.type));
> > > > +      generate_scratch_read_1oword(p, inst, dst_high, index,
> > > > false);
> > > > +      /* Now read first GRF (data from first vertex) */
> > > > +      generate_scratch_read_1oword(p, inst, dst, index, true);
> > > > +      return;
> > > > +   }
> > > > +
> > > >     generate_oword_dual_block_offsets(p, brw_message_reg(inst-
> > > > > base_mrf + 1),
> > > > 
> > > >  				     index);
> > > >  
> > > > @@ -1192,6 +1269,57 @@ generate_scratch_write(struct
> > > > brw_codegen
> > > > *p,
> > > >     struct brw_reg header = brw_vec8_grf(0, 0);
> > > >     bool write_commit;
> > > >  
> > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > +       type_sz(src.type) == 8) {
> > > > +      brw_set_default_access_mode(p, BRW_ALIGN_1);
> > > > +
> > > > +      /* The messages only works with group == 0, we use the
> > > > group
> > > > to know which
> > > > +       * message emit (1-OWORD LOW or 1-OWORD HIGH).
> > > > +       */
> > > > +      brw_set_default_group(p, 0);
> > > > +
> > > > +      if (inst->group == 0) {
> > > > +         for (int i = 0; i < 2; i++) {
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > +            brw_set_default_mask_control(p, true);
> > > > +            struct brw_reg temp =
> > > > +               retype(suboffset(src, i * 16 /
> > > > type_sz(src.type)),
> > > > BRW_REGISTER_TYPE_UD);
> > > > +            temp = stride(temp, 4, 4, 1);
> > > > +
> > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 0),
> > > > +                    temp);
> > > > +            brw_set_default_mask_control(p, inst-
> > > > > force_writemask_all);
> > > > 
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > +
> > > > +            /* Offset in OWORDs */
> > > > +            brw_oword_block_write_scratch(p,
> > > > brw_message_reg(inst-
> > > > > base_mrf),
> > > > 
> > > > +                                          1, 32*inst->offset +
> > > > 16*i, true, false);
> > > > +         }
> > > > +      } else {
> > > > +         for (int i = 0; i < 2; i++) {
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > +
> > > > +            brw_set_default_mask_control(p, true);
> > > > +            struct brw_reg temp =
> > > > +               retype(suboffset(src, i * 16 /
> > > > type_sz(src.type)),
> > > > BRW_REGISTER_TYPE_UD);
> > > > +            temp = stride(temp, 4, 4, 1);
> > > > +
> > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 4),
> > > > +                    temp);
> > > > +
> > > > +            brw_set_default_mask_control(p, inst-
> > > > > force_writemask_all);
> > > > 
> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > +
> > > > +            /* Offset in OWORDs */
> > > > +            brw_oword_block_write_scratch(p,
> > > > brw_message_reg(inst-
> > > > > base_mrf),
> > > > 
> > > > +                                          1, 32*inst->offset +
> > > > 16*i + 32, false, true);
> > > > +         }
> > > > +      }
> > > > +      brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
> > > > +      brw_set_default_access_mode(p, BRW_ALIGN_16);
> > > > +      return;
> > > > +   }
> > > > +
> > > >     /* If the instruction is predicated, we'll predicate the
> > > > send,
> > > > not
> > > >      * the header setup.
> > > >      */
> > > > @@ -1780,6 +1908,14 @@ generate_code(struct brw_codegen *p,
> > > >           generate_vs_urb_write(p, inst);
> > > >           break;
> > > >  
> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +         generate_scratch_read_1oword(p, inst, dst, src[0],
> > > > true);
> > > > +         fill_count++;
> > > > +         break;
> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > +         generate_scratch_read_1oword(p, inst, dst, src[0],
> > > > false);
> > > > +         fill_count++;
> > > > +         break;
> > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >           generate_scratch_read(p, inst, dst, src[0]);
> > > >           fill_count++;
> > > > diff --git a/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > b/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > index a0ba77b867..ec5ba10e86 100644
> > > > --- a/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > +++ b/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > @@ -332,7 +332,9 @@ can_use_scratch_for_source(const
> > > > vec4_instruction *inst, unsigned i,
> > > >         * reusing scratch_reg for this instruction.
> > > >         */
> > > >        if (prev_inst->opcode ==
> > > > SHADER_OPCODE_GEN4_SCRATCH_WRITE ||
> > > > -          prev_inst->opcode ==
> > > > SHADER_OPCODE_GEN4_SCRATCH_READ)
> > > > +          prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ
> > > > ||
> > > > +          prev_inst->opcode ==
> > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW ||
> > > > +          prev_inst->opcode ==
> > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH)
> > > >           continue;
> > > >  
> > > >        /* If the previous instruction does not write to
> > > > scratch_reg, then check
> > > > @@ -467,6 +469,8 @@ vec4_visitor::evaluate_spill_costs(float
> > > > *spill_costs, bool *no_spill)
> > > >           loop_scale /= 10;
> > > >           break;
> > > >  
> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > >        case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > >           for (int i = 0; i < 3; i++) {
> > > > diff --git a/src/intel/compiler/brw_vec4_visitor.cpp
> > > > b/src/intel/compiler/brw_vec4_visitor.cpp
> > > > index 22ee4dd1c4..37ae31c0d5 100644
> > > > --- a/src/intel/compiler/brw_vec4_visitor.cpp
> > > > +++ b/src/intel/compiler/brw_vec4_visitor.cpp
> > > > @@ -264,6 +264,24 @@ vec4_visitor::SCRATCH_READ(const dst_reg
> > > > &dst,
> > > > const src_reg &index)
> > > >  }
> > > >  
> > > >  vec4_instruction *
> > > > +vec4_visitor::DF_IVB_SCRATCH_READ(const dst_reg &dst,
> > > > +                                  const src_reg &index,
> > > > +                                  bool first_grf)
> > > > +{
> > > > +   vec4_instruction *inst;
> > > > +   enum opcode op = first_grf ?
> > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW :
> > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH;
> > > > +
> > > > +   inst = new(mem_ctx) vec4_instruction(op,
> > > > +                                        dst, index);
> > > > +   inst->base_mrf = FIRST_SPILL_MRF(devinfo->gen) + 1;
> > > > +   inst->mlen = 1;
> > > > +
> > > > +   return inst;
> > > > +}
> > > > +
> > > > +vec4_instruction *
> > > >  vec4_visitor::SCRATCH_WRITE(const dst_reg &dst, const src_reg
> > > > &src,
> > > >                              const src_reg &index)
> > > >  {
> > > > @@ -1472,6 +1490,37 @@
> > > > vec4_visitor::get_scratch_offset(bblock_t
> > > > *block, vec4_instruction *inst,
> > > >  
> > > >  /**
> > > >   * Emits an instruction before @inst to load the value named
> > > > by
> > > > @orig_src
> > > > + * from scratch space at @base_offset to @temp. This
> > > > instruction
> > > > only reads
> > > > + * DF value on IVB, one GRF each time.
> > > > + *
> > > > + * @base_offset is measured in 32-byte units (the size of a
> > > > register).
> > > > + * @first_grf indicates if we want to read first vertex data
> > > > (true) or
> > > > + * the second (false).
> > > > + */
> > > > +void
> > > > +vec4_visitor::emit_1grf_df_ivb_scratch_read(bblock_t *block,
> > > > +                                            vec4_instruction
> > > > *inst,
> > > > +                                            dst_reg temp,
> > > > src_reg
> > > > orig_src,
> > > > +                                            int base_offset,
> > > > bool
> > > > first_grf)
> > > > +{
> > > > +   assert(orig_src.offset % REG_SIZE == 0);
> > > > +   src_reg index = get_scratch_offset(block, inst, 0,
> > > > base_offset);
> > > > +
> > > > +   assert(devinfo->gen == 7 && !devinfo->is_haswell &&
> > > > type_sz(temp.type) == 8);
> > > > +   temp.offset = 0;
> > > > +   vec4_instruction *read = DF_IVB_SCRATCH_READ(temp, index,
> > > > first_grf);
> > > > +   read->exec_size = 4;
> > > > +   /* The instruction will use group 0 but a different message
> > > > depending of the
> > > > +    * vertex data to load.
> > > > +    */
> > > > +   read->group = 0;
> > > > +   read->offset = base_offset;
> > > > +   read->size_written = 1;
> > > > +   emit_before(block, inst, read);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Emits an instruction before @inst to load the value named
> > > > by
> > > > @orig_src
> > > >   * from scratch space at @base_offset to @temp.
> > > >   *
> > > >   * @base_offset is measured in 32-byte units (the size of a
> > > > register).
> > > > -- 
> > > > 2.11.0