[Mesa-dev] [PATCH v2 3/6] i965/vec4: Don't emit scratch reads for a spilled register we have just written

Tue Aug 4 00:01:45 PDT 2015

On Fri, 2015-07-31 at 13:12 +0300, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
> 
> > On Thu, 2015-07-30 at 17:08 +0300, Francisco Jerez wrote:
> >> Iago Toral Quiroga <itoral at igalia.com> writes:
> >> 
> >> > When we have code such as this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> > ...
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> >
> >> > And vgrf1 is chosen for spilling, we can emit this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> > ...
> >> > gen4_scratch_read vgrf4.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
> >> >
> >> > Instead of this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
> >> > gen4_scratch_read vgrf4.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
> >> > ...
> >> > gen4_scratch_read vgrf5.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf5.xxxx:F
> >> >
> >> > And save one scratch read while still preserving the benefits of
> >> > spilling the register.
> >> >
> >> > In general, we avoid emitting scratch reads for as long as the next instruction
> >> > keeps reading the spilled register. This should not harm the benefit of
> >> > spilling the register because gains for register allocation only come when we
> >> > have chunks of program code where the register is alive but not really used
> >> > (because these are the points where we could effectively use that register for
> >> > another purpose if we spilled it), so as long as consecutive instructions use
> >> > that register we can avoid the scratch reads without losing anything.
> >> > ---
> >> >  .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     | 37 +++++++++++++++++++++-
> >> >  1 file changed, 36 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > index cff5406..fd56dae 100644
> >> > --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > @@ -340,11 +340,43 @@ vec4_visitor::spill_reg(int spill_reg_nr)
> >> >     unsigned int spill_offset = last_scratch++;
> >> >  
> >> >     /* Generate spill/unspill instructions for the objects being spilled. */
> >> > +   vec4_instruction *spill_write_inst = NULL;
> >> >     foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
> >> > +      /* We don't spill registers used for scratch */
> >> > +      if (inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
> >> > +          inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE)
> >> > +         continue;
> >> > +
> >> >        int scratch_reg = -1;
> >> > +      bool spill_reg_was_read = false;
> >> >        for (unsigned int i = 0; i < 3; i++) {
> >> >           if (inst->src[i].file == GRF && inst->src[i].reg == spill_reg_nr) {
> >> > -            if (scratch_reg == -1) {
> >> > +            if (!spill_reg_was_read) {
> >> > +               spill_reg_was_read = (!inst->predicate ||
> >> > +                                     inst->opcode == BRW_OPCODE_SEL);
> >> > +            }
> >> > +
> >> > +            /* If we are reading the spilled register right after writing
> >> > +             * to it we can skip the scratch read and use directly the
> >> > +             * register we used as source for the scratch write. For this
> >> > +             * to work we must check that:
> >> > +             *
> >> > +             * 1) The write is inconditional, that is, it is not predicated or
> >> > +             *    it is a SEL.
> >> > +             * 2) All the channels that we read have been written in that
> >> > +             *    last write instruction.
> >> > +             *
> >> > +             * We keep doing this for as long as the next instruction
> >> > +             * keeps reading the spilled register and break as soon as we
> >> > +             * find an instruction that doesn't.
> >> > +             */
> >> > +            if (spill_write_inst &&
> >> > +                (!spill_write_inst->predicate ||
> >> > +                 spill_write_inst->opcode == BRW_OPCODE_SEL) &&
> >> > +                ((brw_mask_for_swizzle(inst->src[i].swizzle) &
> >> > +                 ~spill_write_inst->dst.writemask) == 0)) {
> >> > +               scratch_reg = spill_write_inst->dst.reg;
> >> > +            } else if (scratch_reg == -1) {
> >> 
> >> One suggestion: You could factor out the rather complex caching logic
> >> into a separate function (e.g. 'bool can_reuse_scratch_for_source(const
> >> vec4_instruction *, unsigned i, unsigned scratch_reg)').  The function
> >> would simply compare scratch_reg with the sources of the current
> >> instruction (up to src) and the sources and destination of the previous
> >> non-scratch_read/write instruction.  If there's a match it would check
> >> that the regioning is compatible with the i-th source and return true in
> >> that case.  This would have several benefits:
> >
> > I think this might need to be a bit more complex. The previous inst's
> > src[i] might read only a subset of the channels that where loaded into
> > scratch_reg so comparing only against that can lead us to think that we
> > can't reuse scratch_reg when in fact we can.
> >
> > I think the process should be more like we loop back looking at the prev
> > instruction for as long as the previous instruction reads the
> > scratch_ref until we find the one that writes to scratch_reg (which must
> > be a scratch read). At that point we check that the writemask in that
> > instruction is compatible with our swizzle on src[i], in which case we
> > can reuse scratch_reg. Does this make sense to you?
> >
> The thing is that there may be no scratch read in the program yet if
> you're being called from evaluate_spill_costs().  I think you want to
> iterate for as long as the previous instruction reads scratch_reg but
> doesn't write it -- If after the end of the loop you end up at an
> instruction that writes, fine, you take the writemask from it, otherwise
> assume XYZW.
> 
> Yes, we should probably change spill_reg() to read whole vec4s at once
> when unspilling a register, reading a subset of the channels doesn't
> save any memory bandwidth and makes the optimization you're implementing
> less likely to succeed.
> 
> > In this case, if we have a successful match, we probably want to keep a
> > reference to that instruction so that for the next src of the current
> > instruction (or for the next instruction in our program) we don't have
> > to repeat that process.
> >
> 
> Keep in mind that the assumption behind this optimization is that a
> given spill candidate is only re-used for a very short sequence of
> consecutive instructions (otherwise it might in fact be harmful), so the
> amount of CPU cycles you can save by doing that is likely to be tiny in
> practice -- Feel free to do it if it's trivial but I wouldn't trade it
> for any additional complexity in the spill_reg() and
> evaluate_spill_costs() loops.
> 
> One last concern I had about this series has to do with the possibility
> of it leading to an infinite loop in the allocate-spill loop -- I guess
> that at least in theory it would be possible for a register used *only*
> in a consecutive sequence of instructions to score quite high if the
> sequence is part of a region of high register pressure.  If the register
> allocator decides it's the best candidate we have a problem because this
> heuristic will prevent us from making any progress.  I guess the easiest
> way around this would be to have an array of counters in
> evaluate_spill_costs() keeping track of the number of individual
> instructions each register had to be (un)spilled for.  At the end of the
> loop you'd check if it's less or equal to one and in that case mark it
> as no_spill.  (Note that in theory this situation is possible even
> without your changes AFAICT, just far less likely)

I think this can't happen. If a register is only used in consecutive
instructions and is selected for spilling, we will at least emit one
scratch read or write for it (in the first instruction that uses it).
The code in evaluate_spill_costs will then mark as no_spill any register
used in a scratch read/write, which will make this register
automatically not selectable for spilling after the first time it is
selected.

> > Iago
> >
> >
> >>  - It would keep ::spill_reg() simple and the caching heuristic factored
> >>    out.  The only thing spill_reg() would still need to take care of is
> >>    keep track of the last spilling temporary (e.g. as you're doing it
> >>    now with the scratch_reg variable) and pass it to
> >>    can_reuse_scratch_for_source(), but it would no longer be necessary
> >>    to reset it to -1 when reuse is not possible.
> >> 
> >>  - It would allow you to use a single implementation of the caching
> >>    policy to implement both spill_reg() and evaluate_spill_costs(), what
> >>    would make sure that things don't go out of sync in the likely case
> >>    that we want to change the caching heuristic in the future.
> >> 
> >> >                 scratch_reg = alloc.allocate(1);
> >> >                 src_reg temp = inst->src[i];
> >> >                 temp.reg = scratch_reg;
> >> > @@ -357,6 +389,9 @@ vec4_visitor::spill_reg(int spill_reg_nr)
> >> >  
> >> >        if (inst->dst.file == GRF && inst->dst.reg == spill_reg_nr) {
> >> >           emit_scratch_write(block, inst, spill_offset);
> >> > +         spill_write_inst = inst;
> >> > +      } else if (spill_write_inst && !spill_reg_was_read) {
> >> > +         spill_write_inst = NULL;
> >> >        }
> >> >     }
> >> >  
> >> > -- 
> >> > 1.9.1