[Mesa-dev] [PATCH v2 3/6] i965/vec4: Don't emit scratch reads for a spilled register we have just written
Iago Toral
itoral at igalia.com
Fri Jul 31 03:36:52 PDT 2015
On Fri, 2015-07-31 at 13:12 +0300, Francisco Jerez wrote:
> Iago Toral <itoral at igalia.com> writes:
>
> > On Thu, 2015-07-30 at 17:08 +0300, Francisco Jerez wrote:
> >> Iago Toral Quiroga <itoral at igalia.com> writes:
> >>
> >> > When we have code such as this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> > ...
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> >
> >> > And vgrf1 is chosen for spilling, we can emit this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
> >> > mov vgrf3.0.x:F, vgrf1.xxxx:F
> >> > ...
> >> > gen4_scratch_read vgrf4.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
> >> >
> >> > Instead of this:
> >> >
> >> > mov vgrf1.0.x:F, vgrf2.xxxx:F
> >> > gen4_scratch_write hw_reg0:F, vgrf1.xxxx:D, 22D
> >> > gen4_scratch_read vgrf4.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf4.xxxx:F
> >> > ...
> >> > gen4_scratch_read vgrf5.0.x:F, 22D
> >> > mov vgrf3.0.x:F, vgrf5.xxxx:F
> >> >
> >> > And save one scratch read while still preserving the benefits of
> >> > spilling the register.
> >> >
> >> > In general, we avoid emitting scratch reads for as long as the next instruction
> >> > keeps reading the spilled register. This should not harm the benefit of
> >> > spilling the register because gains for register allocation only come when we
> >> > have chunks of program code where the register is alive but not really used
> >> > (because these are the points where we could effectively use that register for
> >> > another purpose if we spilled it), so as long as consecutive instructions use
> >> > that register we can avoid the scratch reads without losing anything.
> >> > ---
> >> > .../drivers/dri/i965/brw_vec4_reg_allocate.cpp | 37 +++++++++++++++++++++-
> >> > 1 file changed, 36 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > index cff5406..fd56dae 100644
> >> > --- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
> >> > @@ -340,11 +340,43 @@ vec4_visitor::spill_reg(int spill_reg_nr)
> >> > unsigned int spill_offset = last_scratch++;
> >> >
> >> > /* Generate spill/unspill instructions for the objects being spilled. */
> >> > + vec4_instruction *spill_write_inst = NULL;
> >> > foreach_block_and_inst(block, vec4_instruction, inst, cfg) {
> >> > + /* We don't spill registers used for scratch */
> >> > + if (inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
> >> > + inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE)
> >> > + continue;
> >> > +
> >> > int scratch_reg = -1;
> >> > + bool spill_reg_was_read = false;
> >> > for (unsigned int i = 0; i < 3; i++) {
> >> > if (inst->src[i].file == GRF && inst->src[i].reg == spill_reg_nr) {
> >> > - if (scratch_reg == -1) {
> >> > + if (!spill_reg_was_read) {
> >> > + spill_reg_was_read = (!inst->predicate ||
> >> > + inst->opcode == BRW_OPCODE_SEL);
> >> > + }
> >> > +
> >> > + /* If we are reading the spilled register right after writing
> >> > + * to it we can skip the scratch read and use directly the
> >> > + * register we used as source for the scratch write. For this
> >> > + * to work we must check that:
> >> > + *
> >> > + * 1) The write is inconditional, that is, it is not predicated or
> >> > + * it is a SEL.
> >> > + * 2) All the channels that we read have been written in that
> >> > + * last write instruction.
> >> > + *
> >> > + * We keep doing this for as long as the next instruction
> >> > + * keeps reading the spilled register and break as soon as we
> >> > + * find an instruction that doesn't.
> >> > + */
> >> > + if (spill_write_inst &&
> >> > + (!spill_write_inst->predicate ||
> >> > + spill_write_inst->opcode == BRW_OPCODE_SEL) &&
> >> > + ((brw_mask_for_swizzle(inst->src[i].swizzle) &
> >> > + ~spill_write_inst->dst.writemask) == 0)) {
> >> > + scratch_reg = spill_write_inst->dst.reg;
> >> > + } else if (scratch_reg == -1) {
> >>
> >> One suggestion: You could factor out the rather complex caching logic
> >> into a separate function (e.g. 'bool can_reuse_scratch_for_source(const
> >> vec4_instruction *, unsigned i, unsigned scratch_reg)'). The function
> >> would simply compare scratch_reg with the sources of the current
> >> instruction (up to src) and the sources and destination of the previous
> >> non-scratch_read/write instruction. If there's a match it would check
> >> that the regioning is compatible with the i-th source and return true in
> >> that case. This would have several benefits:
> >
> > I think this might need to be a bit more complex. The previous inst's
> > src[i] might read only a subset of the channels that where loaded into
> > scratch_reg so comparing only against that can lead us to think that we
> > can't reuse scratch_reg when in fact we can.
> >
> > I think the process should be more like we loop back looking at the prev
> > instruction for as long as the previous instruction reads the
> > scratch_ref until we find the one that writes to scratch_reg (which must
> > be a scratch read). At that point we check that the writemask in that
> > instruction is compatible with our swizzle on src[i], in which case we
> > can reuse scratch_reg. Does this make sense to you?
> >
> The thing is that there may be no scratch read in the program yet if
> you're being called from evaluate_spill_costs(). I think you want to
> iterate for as long as the previous instruction reads scratch_reg but
> doesn't write it -- If after the end of the loop you end up at an
> instruction that writes, fine, you take the writemask from it, otherwise
> assume XYZW.
Just to be clear, we can't assume XYZW unless we do the change you
suggest below so I understand that your suggestion includes changing our
scratch reads to always read the entire vec4.
The alternative would be to assume that only the union of the
scratch_reg channels read by the previous instructions are available in
scratch_reg and compare our read mask with that, but as you say, that
would make the optimization less successful in general.
> Yes, we should probably change spill_reg() to read whole vec4s at once
> when unspilling a register, reading a subset of the channels doesn't
> save any memory bandwidth and makes the optimization you're implementing
> less likely to succeed.
>
> > In this case, if we have a successful match, we probably want to keep a
> > reference to that instruction so that for the next src of the current
> > instruction (or for the next instruction in our program) we don't have
> > to repeat that process.
> >
>
> Keep in mind that the assumption behind this optimization is that a
> given spill candidate is only re-used for a very short sequence of
> consecutive instructions (otherwise it might in fact be harmful), so the
> amount of CPU cycles you can save by doing that is likely to be tiny in
> practice -- Feel free to do it if it's trivial but I wouldn't trade it
> for any additional complexity in the spill_reg() and
> evaluate_spill_costs() loops.
>
> One last concern I had about this series has to do with the possibility
> of it leading to an infinite loop in the allocate-spill loop -- I guess
> that at least in theory it would be possible for a register used *only*
> in a consecutive sequence of instructions to score quite high if the
> sequence is part of a region of high register pressure. If the register
> allocator decides it's the best candidate we have a problem because this
> heuristic will prevent us from making any progress. I guess the easiest
> way around this would be to have an array of counters in
> evaluate_spill_costs() keeping track of the number of individual
> instructions each register had to be (un)spilled for. At the end of the
> loop you'd check if it's less or equal to one and in that case mark it
> as no_spill. (Note that in theory this situation is possible even
> without your changes AFAICT, just far less likely)
>
> > Iago
> >
> >
> >> - It would keep ::spill_reg() simple and the caching heuristic factored
> >> out. The only thing spill_reg() would still need to take care of is
> >> keep track of the last spilling temporary (e.g. as you're doing it
> >> now with the scratch_reg variable) and pass it to
> >> can_reuse_scratch_for_source(), but it would no longer be necessary
> >> to reset it to -1 when reuse is not possible.
> >>
> >> - It would allow you to use a single implementation of the caching
> >> policy to implement both spill_reg() and evaluate_spill_costs(), what
> >> would make sure that things don't go out of sync in the likely case
> >> that we want to change the caching heuristic in the future.
> >>
> >> > scratch_reg = alloc.allocate(1);
> >> > src_reg temp = inst->src[i];
> >> > temp.reg = scratch_reg;
> >> > @@ -357,6 +389,9 @@ vec4_visitor::spill_reg(int spill_reg_nr)
> >> >
> >> > if (inst->dst.file == GRF && inst->dst.reg == spill_reg_nr) {
> >> > emit_scratch_write(block, inst, spill_offset);
> >> > + spill_write_inst = inst;
> >> > + } else if (spill_write_inst && !spill_reg_was_read) {
> >> > + spill_write_inst = NULL;
> >> > }
> >> > }
> >> >
> >> > --
> >> > 1.9.1
More information about the mesa-dev
mailing list