[Mesa-dev] [PATCH v2 3/6] i965/vec4: Don't emit scratch reads for a spilled register we have just written

Thu Aug 6 07:36:03 PDT 2015

Francisco Jerez <currojerez at riseup.net> writes:

> Iago Toral <itoral at igalia.com> writes:
>
>> On Fri, 2015-07-31 at 13:12 +0300, Francisco Jerez wrote:
>>> Iago Toral <itoral at igalia.com> writes:
>>> 
>>> > On Thu, 2015-07-30 at 17:08 +0300, Francisco Jerez wrote:
>>> >> Iago Toral Quiroga <itoral at igalia.com> writes:
>>> >> 
>>> >> > When we have code such as this:
>>> >> >
>>> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>>> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>>> >> > ...
>>> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>>> >> >
>>> >> > And vgrf1 is chosen for spilling, we can emit this:
>>> >> >
>>> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>>> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
>>> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>>> >> > ...
>>> >> > gen4_scratch_read vgrf4.0.x:F, 22D
>>> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
>>> >> >
>>> >> > Instead of this:
>>> >> >
>>> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>>> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
>>> >> > gen4_scratch_read vgrf4.0.x:F, 22D
>>> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
>>> >> > ...
>>> >> > gen4_scratch_read vgrf5.0.x:F, 22D
>>> >> > mov vgrf3.0.x:F, vgrf5.xxxx:F
>>> >> >
>>> >> > And save one scratch read while still preserving the benefits of
>>> >> > spilling the register.
>>> >> >
>>> >> > In general, we avoid emitting scratch reads for as long as the next instruction
>>> >> > keeps reading the spilled register. This should not harm the benefit of
>>> >> > spilling the register because gains for register allocation only come when we
>>> >> > have chunks of program code where the register is alive but not really used
>>> >> > (because these are the points where we could effectively use that register for
>>> >> > another purpose if we spilled it), so as long as consecutive instructions use
>>> >> > that register we can avoid the scratch reads without losing anything.
>>> >> > ---
>>> >> >  .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     | 37 +++++++++++++++++++++-
>>> >> >  1 file changed, 36 insertions(+), 1 deletion(-)
>>> >> >
>>> >> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> >> > index cff5406..fd56dae 100644
>>> >> > --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> >> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>>> >> > @@ -340,11 +340,43 @@ vec4_visitor::spill_reg(int spill_reg_nr)
>>> >> >     unsigned int spill_offset = last_scratch++;
>>> >> >  
>>> >> >     /* Generate spill/unspill instructions for the objects being spilled. */
>>> >> > +   vec4_instruction *spill_write_inst = NULL;
>>> >> >     foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
>>> >> > +      /* We don't spill registers used for scratch */
>>> >> > +      if (inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
>>> >> > +          inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE)
>>> >> > +         continue;
>>> >> > +
>>> >> >        int scratch_reg = -1;
>>> >> > +      bool spill_reg_was_read = false;
>>> >> >        for (unsigned int i = 0; i < 3; i++) {
>>> >> >           if (inst->src[i].file == GRF && inst->src[i].reg == spill_reg_nr) {
>>> >> > -            if (scratch_reg == -1) {
>>> >> > +            if (!spill_reg_was_read) {
>>> >> > +               spill_reg_was_read = (!inst->predicate ||
>>> >> > +                                     inst->opcode == BRW_OPCODE_SEL);
>>> >> > +            }
>>> >> > +
>>> >> > +            /* If we are reading the spilled register right after writing
>>> >> > +             * to it we can skip the scratch read and use directly the
>>> >> > +             * register we used as source for the scratch write. For this
>>> >> > +             * to work we must check that:
>>> >> > +             *
>>> >> > +             * 1) The write is inconditional, that is, it is not predicated or
>>> >> > +             *    it is a SEL.
>>> >> > +             * 2) All the channels that we read have been written in that
>>> >> > +             *    last write instruction.
>>> >> > +             *
>>> >> > +             * We keep doing this for as long as the next instruction
>>> >> > +             * keeps reading the spilled register and break as soon as we
>>> >> > +             * find an instruction that doesn't.
>>> >> > +             */
>>> >> > +            if (spill_write_inst &&
>>> >> > +                (!spill_write_inst->predicate ||
>>> >> > +                 spill_write_inst->opcode == BRW_OPCODE_SEL) &&
>>> >> > +                ((brw_mask_for_swizzle(inst->src[i].swizzle) &
>>> >> > +                 ~spill_write_inst->dst.writemask) == 0)) {
>>> >> > +               scratch_reg = spill_write_inst->dst.reg;
>>> >> > +            } else if (scratch_reg == -1) {
>>> >> 
>>> >> One suggestion: You could factor out the rather complex caching logic
>>> >> into a separate function (e.g. 'bool can_reuse_scratch_for_source(const
>>> >> vec4_instruction *, unsigned i, unsigned scratch_reg)').  The function
>>> >> would simply compare scratch_reg with the sources of the current
>>> >> instruction (up to src) and the sources and destination of the previous
>>> >> non-scratch_read/write instruction.  If there's a match it would check
>>> >> that the regioning is compatible with the i-th source and return true in
>>> >> that case.  This would have several benefits:
>>> >
>>> > I think this might need to be a bit more complex. The previous inst's
>>> > src[i] might read only a subset of the channels that where loaded into
>>> > scratch_reg so comparing only against that can lead us to think that we
>>> > can't reuse scratch_reg when in fact we can.
>>> >
>>> > I think the process should be more like we loop back looking at the prev
>>> > instruction for as long as the previous instruction reads the
>>> > scratch_ref until we find the one that writes to scratch_reg (which must
>>> > be a scratch read). At that point we check that the writemask in that
>>> > instruction is compatible with our swizzle on src[i], in which case we
>>> > can reuse scratch_reg. Does this make sense to you?
>>> >
>>> The thing is that there may be no scratch read in the program yet if
>>> you're being called from evaluate_spill_costs().  I think you want to
>>> iterate for as long as the previous instruction reads scratch_reg but
>>> doesn't write it -- If after the end of the loop you end up at an
>>> instruction that writes, fine, you take the writemask from it, otherwise
>>> assume XYZW.
>>> 
>>> Yes, we should probably change spill_reg() to read whole vec4s at once
>>> when unspilling a register, reading a subset of the channels doesn't
>>> save any memory bandwidth and makes the optimization you're implementing
>>> less likely to succeed.
>>> 
>>> > In this case, if we have a successful match, we probably want to keep a
>>> > reference to that instruction so that for the next src of the current
>>> > instruction (or for the next instruction in our program) we don't have
>>> > to repeat that process.
>>> >
>>> 
>>> Keep in mind that the assumption behind this optimization is that a
>>> given spill candidate is only re-used for a very short sequence of
>>> consecutive instructions (otherwise it might in fact be harmful), so the
>>> amount of CPU cycles you can save by doing that is likely to be tiny in
>>> practice -- Feel free to do it if it's trivial but I wouldn't trade it
>>> for any additional complexity in the spill_reg() and
>>> evaluate_spill_costs() loops.
>>> 
>>> One last concern I had about this series has to do with the possibility
>>> of it leading to an infinite loop in the allocate-spill loop -- I guess
>>> that at least in theory it would be possible for a register used *only*
>>> in a consecutive sequence of instructions to score quite high if the
>>> sequence is part of a region of high register pressure.  If the register
>>> allocator decides it's the best candidate we have a problem because this
>>> heuristic will prevent us from making any progress.  I guess the easiest
>>> way around this would be to have an array of counters in
>>> evaluate_spill_costs() keeping track of the number of individual
>>> instructions each register had to be (un)spilled for.  At the end of the
>>> loop you'd check if it's less or equal to one and in that case mark it
>>> as no_spill.  (Note that in theory this situation is possible even
>>> without your changes AFAICT, just far less likely)
>>
>> I think this can't happen. If a register is only used in consecutive
>> instructions and is selected for spilling, we will at least emit one
>> scratch read or write for it (in the first instruction that uses it).
>> The code in evaluate_spill_costs will then mark as no_spill any register
>> used in a scratch read/write, which will make this register
>> automatically not selectable for spilling after the first time it is
>> selected.
>>
> Ah, good point, disregard my comment in that case.
>
In fact it seems really silly that we emit a scratch write for a spilled
variable that is never read again, it might be worth forcing no_spill
for those as I suggested anyway in order to avoid that.  That said I
don't see any reason to change that now because as you say the infinite
loop shouldn't happen in practice, so it can always be done as a
follow-up.

>>> > Iago
>>> >
>>> >
>>> >>  - It would keep ::spill_reg() simple and the caching heuristic factored
>>> >>    out.  The only thing spill_reg() would still need to take care of is
>>> >>    keep track of the last spilling temporary (e.g. as you're doing it
>>> >>    now with the scratch_reg variable) and pass it to
>>> >>    can_reuse_scratch_for_source(), but it would no longer be necessary
>>> >>    to reset it to -1 when reuse is not possible.
>>> >> 
>>> >>  - It would allow you to use a single implementation of the caching
>>> >>    policy to implement both spill_reg() and evaluate_spill_costs(), what
>>> >>    would make sure that things don't go out of sync in the likely case
>>> >>    that we want to change the caching heuristic in the future.
>>> >> 
>>> >> >                 scratch_reg = alloc.allocate(1);
>>> >> >                 src_reg temp = inst->src[i];
>>> >> >                 temp.reg = scratch_reg;
>>> >> > @@ -357,6 +389,9 @@ vec4_visitor::spill_reg(int spill_reg_nr)
>>> >> >  
>>> >> >        if (inst->dst.file == GRF && inst->dst.reg == spill_reg_nr) {
>>> >> >           emit_scratch_write(block, inst, spill_offset);
>>> >> > +         spill_write_inst = inst;
>>> >> > +      } else if (spill_write_inst && !spill_reg_was_read) {
>>> >> > +         spill_write_inst = NULL;
>>> >> >        }
>>> >> >     }
>>> >> >  
>>> >> > -- 
>>> >> > 1.9.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150806/720f62b5/attachment.sig>