[Mesa-dev] [PATCH v2 3/6] i965/vec4: Don't emit scratch reads for a spilled register we have just written

Francisco Jerez currojerez at riseup.net
Fri Jul 31 03:12:40 PDT 2015


Iago Toral <itoral at igalia.com> writes:

> On Thu, 2015-07-30 at 17:08 +0300, Francisco Jerez wrote:
>> Iago Toral Quiroga <itoral at igalia.com> writes:
>> 
>> > When we have code such as this:
>> >
>> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>> > ...
>> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>> >
>> > And vgrf1 is chosen for spilling, we can emit this:
>> >
>> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
>> > mov vgrf3.0.x:F, vgrf1.xxxx:F
>> > ...
>> > gen4_scratch_read vgrf4.0.x:F, 22D
>> > mov vgrf3.0.x:F, vgrf4.xxxx:F
>> >
>> > Instead of this:
>> >
>> > mov vgrf1.0.x:F, vgrf2.xxxx:F
>> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
>> > gen4_scratch_read vgrf4.0.x:F, 22D
>> > mov vgrf3.0.x:F, vgrf4.xxxx:F
>> > ...
>> > gen4_scratch_read vgrf5.0.x:F, 22D
>> > mov vgrf3.0.x:F, vgrf5.xxxx:F
>> >
>> > And save one scratch read while still preserving the benefits of
>> > spilling the register.
>> >
>> > In general, we avoid emitting scratch reads for as long as the next instruction
>> > keeps reading the spilled register. This should not harm the benefit of
>> > spilling the register because gains for register allocation only come when we
>> > have chunks of program code where the register is alive but not really used
>> > (because these are the points where we could effectively use that register for
>> > another purpose if we spilled it), so as long as consecutive instructions use
>> > that register we can avoid the scratch reads without losing anything.
>> > ---
>> >  .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     | 37 +++++++++++++++++++++-
>> >  1 file changed, 36 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > index cff5406..fd56dae 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
>> > @@ -340,11 +340,43 @@ vec4_visitor::spill_reg(int spill_reg_nr)
>> >     unsigned int spill_offset = last_scratch++;
>> >  
>> >     /* Generate spill/unspill instructions for the objects being spilled. */
>> > +   vec4_instruction *spill_write_inst = NULL;
>> >     foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
>> > +      /* We don't spill registers used for scratch */
>> > +      if (inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
>> > +          inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE)
>> > +         continue;
>> > +
>> >        int scratch_reg = -1;
>> > +      bool spill_reg_was_read = false;
>> >        for (unsigned int i = 0; i < 3; i++) {
>> >           if (inst->src[i].file == GRF && inst->src[i].reg == spill_reg_nr) {
>> > -            if (scratch_reg == -1) {
>> > +            if (!spill_reg_was_read) {
>> > +               spill_reg_was_read = (!inst->predicate ||
>> > +                                     inst->opcode == BRW_OPCODE_SEL);
>> > +            }
>> > +
>> > +            /* If we are reading the spilled register right after writing
>> > +             * to it we can skip the scratch read and use directly the
>> > +             * register we used as source for the scratch write. For this
>> > +             * to work we must check that:
>> > +             *
>> > +             * 1) The write is inconditional, that is, it is not predicated or
>> > +             *    it is a SEL.
>> > +             * 2) All the channels that we read have been written in that
>> > +             *    last write instruction.
>> > +             *
>> > +             * We keep doing this for as long as the next instruction
>> > +             * keeps reading the spilled register and break as soon as we
>> > +             * find an instruction that doesn't.
>> > +             */
>> > +            if (spill_write_inst &&
>> > +                (!spill_write_inst->predicate ||
>> > +                 spill_write_inst->opcode == BRW_OPCODE_SEL) &&
>> > +                ((brw_mask_for_swizzle(inst->src[i].swizzle) &
>> > +                 ~spill_write_inst->dst.writemask) == 0)) {
>> > +               scratch_reg = spill_write_inst->dst.reg;
>> > +            } else if (scratch_reg == -1) {
>> 
>> One suggestion: You could factor out the rather complex caching logic
>> into a separate function (e.g. 'bool can_reuse_scratch_for_source(const
>> vec4_instruction *, unsigned i, unsigned scratch_reg)').  The function
>> would simply compare scratch_reg with the sources of the current
>> instruction (up to src) and the sources and destination of the previous
>> non-scratch_read/write instruction.  If there's a match it would check
>> that the regioning is compatible with the i-th source and return true in
>> that case.  This would have several benefits:
>
> I think this might need to be a bit more complex. The previous inst's
> src[i] might read only a subset of the channels that where loaded into
> scratch_reg so comparing only against that can lead us to think that we
> can't reuse scratch_reg when in fact we can.
>
> I think the process should be more like we loop back looking at the prev
> instruction for as long as the previous instruction reads the
> scratch_ref until we find the one that writes to scratch_reg (which must
> be a scratch read). At that point we check that the writemask in that
> instruction is compatible with our swizzle on src[i], in which case we
> can reuse scratch_reg. Does this make sense to you?
>
The thing is that there may be no scratch read in the program yet if
you're being called from evaluate_spill_costs().  I think you want to
iterate for as long as the previous instruction reads scratch_reg but
doesn't write it -- If after the end of the loop you end up at an
instruction that writes, fine, you take the writemask from it, otherwise
assume XYZW.

Yes, we should probably change spill_reg() to read whole vec4s at once
when unspilling a register, reading a subset of the channels doesn't
save any memory bandwidth and makes the optimization you're implementing
less likely to succeed.

> In this case, if we have a successful match, we probably want to keep a
> reference to that instruction so that for the next src of the current
> instruction (or for the next instruction in our program) we don't have
> to repeat that process.
>

Keep in mind that the assumption behind this optimization is that a
given spill candidate is only re-used for a very short sequence of
consecutive instructions (otherwise it might in fact be harmful), so the
amount of CPU cycles you can save by doing that is likely to be tiny in
practice -- Feel free to do it if it's trivial but I wouldn't trade it
for any additional complexity in the spill_reg() and
evaluate_spill_costs() loops.

One last concern I had about this series has to do with the possibility
of it leading to an infinite loop in the allocate-spill loop -- I guess
that at least in theory it would be possible for a register used *only*
in a consecutive sequence of instructions to score quite high if the
sequence is part of a region of high register pressure.  If the register
allocator decides it's the best candidate we have a problem because this
heuristic will prevent us from making any progress.  I guess the easiest
way around this would be to have an array of counters in
evaluate_spill_costs() keeping track of the number of individual
instructions each register had to be (un)spilled for.  At the end of the
loop you'd check if it's less or equal to one and in that case mark it
as no_spill.  (Note that in theory this situation is possible even
without your changes AFAICT, just far less likely)

> Iago
>
>
>>  - It would keep ::spill_reg() simple and the caching heuristic factored
>>    out.  The only thing spill_reg() would still need to take care of is
>>    keep track of the last spilling temporary (e.g. as you're doing it
>>    now with the scratch_reg variable) and pass it to
>>    can_reuse_scratch_for_source(), but it would no longer be necessary
>>    to reset it to -1 when reuse is not possible.
>> 
>>  - It would allow you to use a single implementation of the caching
>>    policy to implement both spill_reg() and evaluate_spill_costs(), what
>>    would make sure that things don't go out of sync in the likely case
>>    that we want to change the caching heuristic in the future.
>> 
>> >                 scratch_reg = alloc.allocate(1);
>> >                 src_reg temp = inst->src[i];
>> >                 temp.reg = scratch_reg;
>> > @@ -357,6 +389,9 @@ vec4_visitor::spill_reg(int spill_reg_nr)
>> >  
>> >        if (inst->dst.file == GRF && inst->dst.reg == spill_reg_nr) {
>> >           emit_scratch_write(block, inst, spill_offset);
>> > +         spill_write_inst = inst;
>> > +      } else if (spill_write_inst && !spill_reg_was_read) {
>> > +         spill_write_inst = NULL;
>> >        }
>> >     }
>> >  
>> > -- 
>> > 1.9.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150731/216b9f9b/attachment.sig>


More information about the mesa-dev mailing list