[Mesa-dev] [PATCH 2/6] i965/vec4/generator: use 1-Oword Block Read/Write messages for DF scratch writes/reads

Tue Jun 27 06:07:54 UTC 2017

On Mon, 2017-06-26 at 10:38 -0700, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
> > On Fri, 2017-06-23 at 11:06 -0700, Francisco Jerez wrote:
> > > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> > > 
> > > > On Thu, 2017-06-22 at 16:25 -0700, Francisco Jerez wrote:
> > > > > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> > > > > 
> > > > > > Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.
> > > > > > com>
> > > > > > ---
> > > > > >  src/intel/compiler/brw_eu_defines.h          |   2 +
> > > > > >  src/intel/compiler/brw_shader.cpp            |   5 +
> > > > > >  src/intel/compiler/brw_vec4.cpp              |   7 ++
> > > > > >  src/intel/compiler/brw_vec4.h                |   8 ++
> > > > > >  src/intel/compiler/brw_vec4_generator.cpp    | 136
> > > > > > +++++++++++++++++++++++++++
> > > > > >  src/intel/compiler/brw_vec4_reg_allocate.cpp |   6 +-
> > > > > >  src/intel/compiler/brw_vec4_visitor.cpp      |  49
> > > > > > ++++++++++
> > > > > >  7 files changed, 212 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/src/intel/compiler/brw_eu_defines.h
> > > > > > b/src/intel/compiler/brw_eu_defines.h
> > > > > > index 1af835d47e..3c148de0fa 100644
> > > > > > --- a/src/intel/compiler/brw_eu_defines.h
> > > > > > +++ b/src/intel/compiler/brw_eu_defines.h
> > > > > > @@ -436,6 +436,8 @@ enum opcode {
> > > > > >     VEC4_OPCODE_PICK_HIGH_32BIT,
> > > > > >     VEC4_OPCODE_SET_LOW_32BIT,
> > > > > >     VEC4_OPCODE_SET_HIGH_32BIT,
> > > > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW,
> > > > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH,
> > > > > >  
> > > > > 
> > > > > What's the point of introducing two different opcodes with
> > > > > essentially
> > > > > the same semantics (read 32B worth of data) as the current
> > > > > SHADER_OPCODE_GEN4_SCRATCH_READ?
> > > > 
> > > > Originally I had only SHADER_OPCODE_GEN4_SCRATCH_READ but I
> > > > changed
> > > > it
> > > > to don't allocate more registers than needed when doing scratch
> > > > write
> > > > of a partial DF write. Let me explain it:
> > > > 
> > > > When doing spilling, as DF instructions are both split and
> > > > scalarized,
> > > > we read the existing contents in scratch memory, overwrite them
> > > > with
> > > > the destination of the instruction, then emit scratch write.
> > > > Together
> > > > with the fact that I am not shuffling DF data, we only need to
> > > > allocate
> > > > 1 GRF to do so, instead of 2 (if I had emitted
> > > > SHADER_OPCODE_GEN4_SCRATCH_READ), when doing spilling on
> > > > partial DF
> > > > writes.
> > > > 
> > > 
> > > Why would you need to allocate more GRFs for
> > > SHADER_OPCODE_GEN4_SCRATCH_READ?  It also only reads one
> > > register,
> > > which
> > > should be sufficient for a single scalarized instruction as long
> > > as
> > > you
> > > don't shuffle data around -- Have a look at how the FS back-end
> > > addresses this problem.
> > > 
> > 
> > OK
> > 
> > > > >   Is there any downside from using the
> > > > > current opcode with force_writemask_all?  If anything it
> > > > > would
> > > > > give
> > > > > you
> > > > > better performance because you'd only have to set up one
> > > > > header
> > > > > (which
> > > > > stalls the EU pipeline twice), send down one message to the
> > > > > dataport,
> > > > > and avoid stalling to shuffle the data around in the return
> > > > > payload
> > > > > (which prevents your two 1OWORD messages from being pipelined
> > > > > at
> > > > > all).
> > > > > 
> > > > 
> > > > Sorry, I am confused here. Do you mean using
> > > > SHADER_OPCODE_GEN4_SCRATCH_READ as-is, which emits a "OWord
> > > > Dual
> > > > Block
> > > > Read" message (so only one message)?
> > > > 
> > > > If that's the case, then I should shuffle the destination data
> > > > of
> > > > the
> > > > partial DF write, change the 1-Oword block write offsets and so
> > > > on...
> > > 
> > > Why would you need to shuffle any spilled data?  I don't think
> > > there's
> > > much of a benefit from shuffling since scratch overwrites need
> > > read
> > > the
> > > original data for the most part anyway because of
> > > writemasking.  In
> > > fact
> > > shuffling DF data is probably the reason things blow up right now
> > > whenever you have mixed DF and single-precision reads or writes
> > > to
> > > the
> > > same spilled variable, which I guess is the reason you need to
> > > look
> > > for
> > > those cases and mark them as no_spill...
> > > 
> > 
> > Right, I don't need to shuffle data for the scratch write.
> > 
> > > > in order to save it inside scratch memory in the proper place
> > > > to
> > > > make
> > > > OWord Dual Block Read work. That would require to some extra
> > > > instructions, but I don't know if this would give better
> > > > performance
> > > > against current implementation or not.
> > > > 
> > > 
> > > I expect the most serious performance issue with the approach of
> > > this
> > > patch will be the sequence of non-pipelined single-oword reads,
> > > which
> > > means you get to pay for the EU-dataport roundtrip latency twice
> > > instead
> > > of once.
> > > 
> > > > Then, why do I need force_writemask=true when emitting
> > > > SHADER_OPCODE_GEN4_SCRATCH_READ?
> > > > 
> > > 
> > > Because you probably don't want to shuffle data in your scratch
> > > buffer,
> > > and you don't want the dataport to apply bogus 16B channel
> > > enables to
> > > your reads and writes.
> > > 
> > 
> > If we save the dvec4 data of a vertex altogether in consecutive 32
> > bytes in scratch memory (i.e. no need of shuffling and we use
> > force_writemask_all as you said), then we need to create a special
> > case
> > for IVB and partial DFs reads on HSW+ when unspilling the data.
> > 
> > What I am thinking now is if the scratch write is done wisely, we
> > can
> > write the data in the proper places for the two
> > SHADER_OPCODE_GEN4_SCRATCH_READ we use for unspill DF data: write
> > each
> > XY components with the respective 1-OWord scratch write message and
> > ZW
> > components with other 1-OWord scratch write messages with an offset
> > of
> > 32 bytes. Thanks to this, we don't need to touch the current code
> > for
> > unspilling (which does data shuffling) and it allows us to do
> > unspilling on IVB and partial DF reads on HSW+ without any special
> > case.
> > 
> > If we choose the no-shuffling-at-all solution, this is an
> > improvement
> > to what I have sent in this v1, but I am leaning toward the
> > solution in
> > last paragraph because it re-uses existing code and simplifies the
> > changes, although we have some data shuffling overhead.
> > 
> > What do you think?
> > 
> 
> Cannot we just drop the shuffling on HSW+ too?  AFAIA it has the same
> drawbacks on HSW+ as it has on IVB, so I don't see any reason for
> supporting both codepaths.
> 

OK!

Sam

> > Sam
> > 
> > 
> > > > I can try this alternative solution if this is what you meant.
> > > > It
> > > > has
> > > > the advantage of simplifying the changes a lot, which is always
> > > > great.
> > > > 
> > > > Sam
> > > > 
> > > > > >     FS_OPCODE_DDX_COARSE,
> > > > > >     FS_OPCODE_DDX_FINE,
> > > > > > diff --git a/src/intel/compiler/brw_shader.cpp
> > > > > > b/src/intel/compiler/brw_shader.cpp
> > > > > > index 53d0742d2e..248feacbd2 100644
> > > > > > --- a/src/intel/compiler/brw_shader.cpp
> > > > > > +++ b/src/intel/compiler/brw_shader.cpp
> > > > > > @@ -296,6 +296,11 @@ brw_instruction_name(const struct
> > > > > > gen_device_info *devinfo, enum opcode op)
> > > > > >     case FS_OPCODE_PACK:
> > > > > >        return "pack";
> > > > > >  
> > > > > > +
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +      return "gen4_scratch_read_1word_low";
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > > +      return "gen4_scratch_read_1word_high";
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >        return "gen4_scratch_read";
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > > > diff --git a/src/intel/compiler/brw_vec4.cpp
> > > > > > b/src/intel/compiler/brw_vec4.cpp
> > > > > > index b443effca9..b6d409eea2 100644
> > > > > > --- a/src/intel/compiler/brw_vec4.cpp
> > > > > > +++ b/src/intel/compiler/brw_vec4.cpp
> > > > > > @@ -259,6 +259,8 @@ bool
> > > > > >  vec4_instruction::can_do_writemask(const struct
> > > > > > gen_device_info
> > > > > > *devinfo)
> > > > > >  {
> > > > > >     switch (opcode) {
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >     case VEC4_OPCODE_DOUBLE_TO_F32:
> > > > > >     case VEC4_OPCODE_DOUBLE_TO_D32:
> > > > > > @@ -335,6 +337,9 @@
> > > > > > vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
> > > > > >        return 1;
> > > > > >     case VS_OPCODE_PULL_CONSTANT_LOAD:
> > > > > >        return 2;
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > > +      return 1;
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >        return 2;
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > > > @@ -2091,6 +2096,8 @@ get_lowered_simd_width(const struct
> > > > > > gen_device_info *devinfo,
> > > > > >  {
> > > > > >     /* Do not split some instructions that require special
> > > > > > handling
> > > > > > */
> > > > > >     switch (inst->opcode) {
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > > >        return inst->exec_size;
> > > > > > diff --git a/src/intel/compiler/brw_vec4.h
> > > > > > b/src/intel/compiler/brw_vec4.h
> > > > > > index d828da02ea..a5b45aca21 100644
> > > > > > --- a/src/intel/compiler/brw_vec4.h
> > > > > > +++ b/src/intel/compiler/brw_vec4.h
> > > > > > @@ -214,6 +214,9 @@ public:
> > > > > >                          enum brw_conditional_mod
> > > > > > condition);
> > > > > >     vec4_instruction *IF(enum brw_predicate predicate);
> > > > > >     EMIT1(SCRATCH_READ)
> > > > > > +   vec4_instruction *DF_IVB_SCRATCH_READ(const dst_reg
> > > > > > &dst,
> > > > > > const
> > > > > > src_reg &src0,
> > > > > > +                                         bool low);
> > > > > > +
> > > > > >     EMIT2(SCRATCH_WRITE)
> > > > > >     EMIT3(LRP)
> > > > > >     EMIT1(BFREV)
> > > > > > @@ -294,6 +297,11 @@ public:
> > > > > >  			  dst_reg dst,
> > > > > >  			  src_reg orig_src,
> > > > > >  			  int base_offset);
> > > > > > +   void emit_1grf_df_ivb_scratch_read(bblock_t *block,
> > > > > > +                                      vec4_instruction
> > > > > > *inst,
> > > > > > +                                      dst_reg temp,
> > > > > > src_reg
> > > > > > orig_src,
> > > > > > +                                      int base_offset,
> > > > > > bool
> > > > > > first_grf);
> > > > > > +
> > > > > >     void emit_scratch_write(bblock_t *block,
> > > > > > vec4_instruction
> > > > > > *inst,
> > > > > >  			   int base_offset);
> > > > > >     void emit_pull_constant_load(bblock_t *block,
> > > > > > vec4_instruction
> > > > > > *inst,
> > > > > > diff --git a/src/intel/compiler/brw_vec4_generator.cpp
> > > > > > b/src/intel/compiler/brw_vec4_generator.cpp
> > > > > > index 334933d15a..3bb931385a 100644
> > > > > > --- a/src/intel/compiler/brw_vec4_generator.cpp
> > > > > > +++ b/src/intel/compiler/brw_vec4_generator.cpp
> > > > > > @@ -1133,6 +1133,73 @@ generate_unpack_flags(struct
> > > > > > brw_codegen
> > > > > > *p,
> > > > > >  }
> > > > > >  
> > > > > >  static void
> > > > > > +generate_scratch_read_1oword(struct brw_codegen *p,
> > > > > > +                             vec4_instruction *inst,
> > > > > > +                             struct brw_reg dst,
> > > > > > +                             struct brw_reg index,
> > > > > > +                             bool low)
> > > > > > +{
> > > > > > +   const struct gen_device_info *devinfo = p->devinfo;
> > > > > > +
> > > > > > +   assert(devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > > > +          type_sz(dst.type) == 8);
> > > > > > +   brw_set_default_access_mode(p, BRW_ALIGN_1);
> > > > > > +   brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > > > +
> > > > > > +   if (!low) {
> > > > > > +      /* Read second GRF (offset in OWORDs) */
> > > > > > +      for (int i = 0; i < 2; i++) {
> > > > > > +         brw_oword_block_read_scratch(p,
> > > > > > +                                      dst,
> > > > > > +                                      brw_message_reg(inst
> > > > > > -
> > > > > > > base_mrf),
> > > > > > 
> > > > > > +                                      1, 32*inst->offset +
> > > > > > 16*i +
> > > > > > 32, false, true);
> > > > > > +         if (i == 0) {
> > > > > > +            /* The scratch read message writes the 128 MSB
> > > > > > (OWORD1
> > > > > > HIGH) of
> > > > > > +             * the destination. We need to move them to
> > > > > > dst.0
> > > > > > so
> > > > > > we can
> > > > > > +             * read the pending 128 bits without using a
> > > > > > temporary
> > > > > > register.
> > > > > > +             */
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > > > +            struct brw_reg tmp =
> > > > > > +               stride(suboffset(dst, 16 /
> > > > > > type_sz(dst.type)),
> > > > > > +                      4, 4, 1);
> > > > > > +
> > > > > > +            brw_set_default_mask_control(p, true);
> > > > > > +            brw_MOV(p, dst, tmp);
> > > > > > +            brw_set_default_mask_control(p, inst-
> > > > > > > force_writemask_all);
> > > > > > 
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > > > +         }
> > > > > > +      }
> > > > > > +   } else {
> > > > > > +      /* Read first GRF (offset in OWORDs) */
> > > > > > +      for (int i = 1; i >= 0; i--) {
> > > > > > +         brw_oword_block_read_scratch(p,
> > > > > > +                                      dst,
> > > > > > +                                      brw_message_reg(inst
> > > > > > -
> > > > > > > base_mrf),
> > > > > > 
> > > > > > +                                      1, 32*inst->offset +
> > > > > > 16*i,
> > > > > > true, false);
> > > > > > +
> > > > > > +         if (i == 1) {
> > > > > > +            /* The scratch read message writes the 128 LSB
> > > > > > (OWORD1
> > > > > > LOW) of
> > > > > > +             * the destination. We need to move them to
> > > > > > dst.4
> > > > > > so
> > > > > > we can
> > > > > > +             * read the pending 128 bits without using a
> > > > > > temporary
> > > > > > register.
> > > > > > +             */
> > > > > > +            struct brw_reg tmp = stride(dst, 4, 4, 1);
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > > > +            brw_set_default_mask_control(p, true);
> > > > > > +            brw_MOV(p,
> > > > > > +                    suboffset(dst, 16 /
> > > > > > type_sz(dst.type)),
> > > > > > +                    tmp);
> > > > > > +            brw_set_default_mask_control(p, inst-
> > > > > > > force_writemask_all);
> > > > > > 
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > > > +         }
> > > > > > +      }
> > > > > > +   }
> > > > > > +
> > > > > > +   brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
> > > > > > +   brw_set_default_access_mode(p, BRW_ALIGN_16);
> > > > > > +   return;
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > >  generate_scratch_read(struct brw_codegen *p,
> > > > > >                        vec4_instruction *inst,
> > > > > >                        struct brw_reg dst,
> > > > > > @@ -1143,6 +1210,16 @@ generate_scratch_read(struct
> > > > > > brw_codegen
> > > > > > *p,
> > > > > >  
> > > > > >     gen6_resolve_implied_move(p, &header, inst->base_mrf);
> > > > > >  
> > > > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > > > +       type_sz(dst.type) == 8) {
> > > > > > +      /* First read second GRF (offset in OWORDs) */
> > > > > > +      struct brw_reg dst_high = suboffset(dst, 32 /
> > > > > > type_sz(dst.type));
> > > > > > +      generate_scratch_read_1oword(p, inst, dst_high,
> > > > > > index,
> > > > > > false);
> > > > > > +      /* Now read first GRF (data from first vertex) */
> > > > > > +      generate_scratch_read_1oword(p, inst, dst, index,
> > > > > > true);
> > > > > > +      return;
> > > > > > +   }
> > > > > > +
> > > > > >     generate_oword_dual_block_offsets(p,
> > > > > > brw_message_reg(inst-
> > > > > > > base_mrf + 1),
> > > > > > 
> > > > > >  				     index);
> > > > > >  
> > > > > > @@ -1192,6 +1269,57 @@ generate_scratch_write(struct
> > > > > > brw_codegen
> > > > > > *p,
> > > > > >     struct brw_reg header = brw_vec8_grf(0, 0);
> > > > > >     bool write_commit;
> > > > > >  
> > > > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
> > > > > > +       type_sz(src.type) == 8) {
> > > > > > +      brw_set_default_access_mode(p, BRW_ALIGN_1);
> > > > > > +
> > > > > > +      /* The messages only works with group == 0, we use
> > > > > > the
> > > > > > group
> > > > > > to know which
> > > > > > +       * message emit (1-OWORD LOW or 1-OWORD HIGH).
> > > > > > +       */
> > > > > > +      brw_set_default_group(p, 0);
> > > > > > +
> > > > > > +      if (inst->group == 0) {
> > > > > > +         for (int i = 0; i < 2; i++) {
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > > > +            brw_set_default_mask_control(p, true);
> > > > > > +            struct brw_reg temp =
> > > > > > +               retype(suboffset(src, i * 16 /
> > > > > > type_sz(src.type)),
> > > > > > BRW_REGISTER_TYPE_UD);
> > > > > > +            temp = stride(temp, 4, 4, 1);
> > > > > > +
> > > > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1,
> > > > > > 0),
> > > > > > +                    temp);
> > > > > > +            brw_set_default_mask_control(p, inst-
> > > > > > > force_writemask_all);
> > > > > > 
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > > > +
> > > > > > +            /* Offset in OWORDs */
> > > > > > +            brw_oword_block_write_scratch(p,
> > > > > > brw_message_reg(inst-
> > > > > > > base_mrf),
> > > > > > 
> > > > > > +                                          1, 32*inst-
> > > > > > >offset +
> > > > > > 16*i, true, false);
> > > > > > +         }
> > > > > > +      } else {
> > > > > > +         for (int i = 0; i < 2; i++) {
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
> > > > > > +
> > > > > > +            brw_set_default_mask_control(p, true);
> > > > > > +            struct brw_reg temp =
> > > > > > +               retype(suboffset(src, i * 16 /
> > > > > > type_sz(src.type)),
> > > > > > BRW_REGISTER_TYPE_UD);
> > > > > > +            temp = stride(temp, 4, 4, 1);
> > > > > > +
> > > > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1,
> > > > > > 4),
> > > > > > +                    temp);
> > > > > > +
> > > > > > +            brw_set_default_mask_control(p, inst-
> > > > > > > force_writemask_all);
> > > > > > 
> > > > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
> > > > > > +
> > > > > > +            /* Offset in OWORDs */
> > > > > > +            brw_oword_block_write_scratch(p,
> > > > > > brw_message_reg(inst-
> > > > > > > base_mrf),
> > > > > > 
> > > > > > +                                          1, 32*inst-
> > > > > > >offset +
> > > > > > 16*i + 32, false, true);
> > > > > > +         }
> > > > > > +      }
> > > > > > +      brw_set_default_exec_size(p, cvt(inst->exec_size) -
> > > > > > 1);
> > > > > > +      brw_set_default_access_mode(p, BRW_ALIGN_16);
> > > > > > +      return;
> > > > > > +   }
> > > > > > +
> > > > > >     /* If the instruction is predicated, we'll predicate
> > > > > > the
> > > > > > send,
> > > > > > not
> > > > > >      * the header setup.
> > > > > >      */
> > > > > > @@ -1780,6 +1908,14 @@ generate_code(struct brw_codegen *p,
> > > > > >           generate_vs_urb_write(p, inst);
> > > > > >           break;
> > > > > >  
> > > > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +         generate_scratch_read_1oword(p, inst, dst,
> > > > > > src[0],
> > > > > > true);
> > > > > > +         fill_count++;
> > > > > > +         break;
> > > > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > > +         generate_scratch_read_1oword(p, inst, dst,
> > > > > > src[0],
> > > > > > false);
> > > > > > +         fill_count++;
> > > > > > +         break;
> > > > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >           generate_scratch_read(p, inst, dst, src[0]);
> > > > > >           fill_count++;
> > > > > > diff --git a/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > > > b/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > > > index a0ba77b867..ec5ba10e86 100644
> > > > > > --- a/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > > > +++ b/src/intel/compiler/brw_vec4_reg_allocate.cpp
> > > > > > @@ -332,7 +332,9 @@ can_use_scratch_for_source(const
> > > > > > vec4_instruction *inst, unsigned i,
> > > > > >         * reusing scratch_reg for this instruction.
> > > > > >         */
> > > > > >        if (prev_inst->opcode ==
> > > > > > SHADER_OPCODE_GEN4_SCRATCH_WRITE ||
> > > > > > -          prev_inst->opcode ==
> > > > > > SHADER_OPCODE_GEN4_SCRATCH_READ)
> > > > > > +          prev_inst->opcode ==
> > > > > > SHADER_OPCODE_GEN4_SCRATCH_READ
> > > > > > > > 
> > > > > > 
> > > > > > +          prev_inst->opcode ==
> > > > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW ||
> > > > > > +          prev_inst->opcode ==
> > > > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH)
> > > > > >           continue;
> > > > > >  
> > > > > >        /* If the previous instruction does not write to
> > > > > > scratch_reg, then check
> > > > > > @@ -467,6 +469,8 @@
> > > > > > vec4_visitor::evaluate_spill_costs(float
> > > > > > *spill_costs, bool *no_spill)
> > > > > >           loop_scale /= 10;
> > > > > >           break;
> > > > > >  
> > > > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
> > > > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
> > > > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
> > > > > >        case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
> > > > > >           for (int i = 0; i < 3; i++) {
> > > > > > diff --git a/src/intel/compiler/brw_vec4_visitor.cpp
> > > > > > b/src/intel/compiler/brw_vec4_visitor.cpp
> > > > > > index 22ee4dd1c4..37ae31c0d5 100644
> > > > > > --- a/src/intel/compiler/brw_vec4_visitor.cpp
> > > > > > +++ b/src/intel/compiler/brw_vec4_visitor.cpp
> > > > > > @@ -264,6 +264,24 @@ vec4_visitor::SCRATCH_READ(const
> > > > > > dst_reg
> > > > > > &dst,
> > > > > > const src_reg &index)
> > > > > >  }
> > > > > >  
> > > > > >  vec4_instruction *
> > > > > > +vec4_visitor::DF_IVB_SCRATCH_READ(const dst_reg &dst,
> > > > > > +                                  const src_reg &index,
> > > > > > +                                  bool first_grf)
> > > > > > +{
> > > > > > +   vec4_instruction *inst;
> > > > > > +   enum opcode op = first_grf ?
> > > > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW :
> > > > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH;
> > > > > > +
> > > > > > +   inst = new(mem_ctx) vec4_instruction(op,
> > > > > > +                                        dst, index);
> > > > > > +   inst->base_mrf = FIRST_SPILL_MRF(devinfo->gen) + 1;
> > > > > > +   inst->mlen = 1;
> > > > > > +
> > > > > > +   return inst;
> > > > > > +}
> > > > > > +
> > > > > > +vec4_instruction *
> > > > > >  vec4_visitor::SCRATCH_WRITE(const dst_reg &dst, const
> > > > > > src_reg
> > > > > > &src,
> > > > > >                              const src_reg &index)
> > > > > >  {
> > > > > > @@ -1472,6 +1490,37 @@
> > > > > > vec4_visitor::get_scratch_offset(bblock_t
> > > > > > *block, vec4_instruction *inst,
> > > > > >  
> > > > > >  /**
> > > > > >   * Emits an instruction before @inst to load the value
> > > > > > named
> > > > > > by
> > > > > > @orig_src
> > > > > > + * from scratch space at @base_offset to @temp. This
> > > > > > instruction
> > > > > > only reads
> > > > > > + * DF value on IVB, one GRF each time.
> > > > > > + *
> > > > > > + * @base_offset is measured in 32-byte units (the size of
> > > > > > a
> > > > > > register).
> > > > > > + * @first_grf indicates if we want to read first vertex
> > > > > > data
> > > > > > (true) or
> > > > > > + * the second (false).
> > > > > > + */
> > > > > > +void
> > > > > > +vec4_visitor::emit_1grf_df_ivb_scratch_read(bblock_t
> > > > > > *block,
> > > > > > +                                            vec4_instructi
> > > > > > on
> > > > > > *inst,
> > > > > > +                                            dst_reg temp,
> > > > > > src_reg
> > > > > > orig_src,
> > > > > > +                                            int
> > > > > > base_offset,
> > > > > > bool
> > > > > > first_grf)
> > > > > > +{
> > > > > > +   assert(orig_src.offset % REG_SIZE == 0);
> > > > > > +   src_reg index = get_scratch_offset(block, inst, 0,
> > > > > > base_offset);
> > > > > > +
> > > > > > +   assert(devinfo->gen == 7 && !devinfo->is_haswell &&
> > > > > > type_sz(temp.type) == 8);
> > > > > > +   temp.offset = 0;
> > > > > > +   vec4_instruction *read = DF_IVB_SCRATCH_READ(temp,
> > > > > > index,
> > > > > > first_grf);
> > > > > > +   read->exec_size = 4;
> > > > > > +   /* The instruction will use group 0 but a different
> > > > > > message
> > > > > > depending of the
> > > > > > +    * vertex data to load.
> > > > > > +    */
> > > > > > +   read->group = 0;
> > > > > > +   read->offset = base_offset;
> > > > > > +   read->size_written = 1;
> > > > > > +   emit_before(block, inst, read);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Emits an instruction before @inst to load the value
> > > > > > named
> > > > > > by
> > > > > > @orig_src
> > > > > >   * from scratch space at @base_offset to @temp.
> > > > > >   *
> > > > > >   * @base_offset is measured in 32-byte units (the size of
> > > > > > a
> > > > > > register).
> > > > > > -- 
> > > > > > 2.11.0