[Mesa-dev] [PATCH 0/5] Improvements to the vec4 spilling code

Fri Jul 24 04:31:47 PDT 2015

Hi,

I have been looking a bit into the vec4 spilling code and this series 
implements a few improvements. The main changes are in patches 1 and 4,
that add small optimizations. The remaining patches are all minor changes.

Also, I noticed that enabling spilling of everything (which is what I used to
test these changes) makes additional piglit tests to fail for some reason. I did
not look into why this happens, but I noticed that this series seems to improve
things slightly for some reason, probably because with this we save a few
scratch loads in some cases. Specifically I get these results in my
IvyBridge laptop for a full piglit run forcing spilling of everything on the
vec4 backend:

With master:
crash: 5, fail: 205, pass: 18630, skip: 9187, warn: 3 -

With this series:
crash: 5, fail: 191, pass: 18639, skip: 9192, warn: 3 -

Besides the changes implemented in this series I also evaluated other ideas
based on initial work by Ben, however, I ended up discarding them because they
would not bring the benefits he was anticipating. I discuss the rationale
below for each one:

1) Reuse the same vgrf for all the scratch reads

The current code allocates a new vgrf every time the spilled register
is read by any instruction, but even if that increases the vgrf count, it is
actually the key for the spilling to be successful. As far as I understand
the register allocation process, we run into the need to spill when we have
conflicts between live vgrfs that can't be allocated simultaneously. These
conflicts, in the end, come from the live analysis, and generally, the longer
the life span of a vgrf, the more difficult its allocation will be. This
makes sense. Once we have decided to spill a register, what we do is that
we turn it into multiple vgrfs that are short-lived. Because these registers
are short lived, they can be easily allocated and we reduce the average number
of conflicts in the allocation process, taking us one step closer to success.
If, on the other hand, we reuse the same vgrf for all scratch reads of the
spilled register, we end up with a vgrf that has exactly the same life span
as the register we spilled, and thus, it has exactly the same conflicts, that is,
we end up exactly in the same situation we were before, only that now we have
one extra vgrf on top. It is even worse if we try to allocate a single vgrf for
all our spills, since that just wouldn't work (as soon as we try to spill more
than one operand of the same instruction we would have a problem).

2) Allow spilling of registers with size > 1

I think this is useless in the vec4 backend because by the time we reach
register allocation we won't have registers with size > 1. This is because
GRF array access is pushed to scratch and then the split_virtual_grfs
pass will split anything that still has size > 1 to things with size = 1.

To my shame I only realized this after doing the changes and noticing that
a full piglit run never hit the case of registers with size > 1.

3) Making spilling costly for adjacent registers

The idea here was to increase the spilling cost for vgrfs that were written by
one instruction and immediately used in the next.

This has one obvious problem and it is that it only considers two instructions.
If the same register is used again much later in the program code, it could
actually be the source of a lot of conflicts (because it is alive for all that
time) and we want to spill it. One of the patches in my series is based on this
idea, but what it does is to directly avoid the scratch read for that operand,
not prevent the register from being spilled.

The second problem with this is that the current algorithm that selects the 
best register to spill already considers this but in a broader, more useful way.
The algorithm selects the vgrf with the best benefit / cost ratio. The benefit
is computed based on the number of interferences that the vgrf produces, so
for a short-lived register that is only written once and immediately used,
it will compute a very small benefit that will make it unlikely to be selected
for spilling (actually, if this is the best we can spill that means that we 
will fail to allocate anyway, since spilling a register like this won't get us
any closer to a successful allocation!). On the other hand, if that same
register is used again much later in the program, the algorithm will probably
compute a high benefit, since it is likely that in this case, being a long-lived
register it would cause a lot of inteferences.

In summary, the current algorithm seems to handle this case more efficiently. 

Iago Toral Quiroga (5):
  i965/vec4: Only emit one scratch read per instruction for spilled
    registers
  i965/vec4: Remove checks for reladdr when checking for spillable
    registers
  i965/vec4: Register spilling should never see registers with size != 1
  i965/vec4: Don't emit scratch reads for a spilled register we have
    just written
  i965: Add a debug option for spilling everything in vec4 code

 src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp  |  2 +-
 src/mesa/drivers/dri/i965/brw_vec4.cpp             |  2 +-
 .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     | 60 ++++++++++++++++++----
 src/mesa/drivers/dri/i965/intel_debug.c            |  3 +-
 src/mesa/drivers/dri/i965/intel_debug.h            |  5 +-
 5 files changed, 57 insertions(+), 15 deletions(-)

-- 
1.9.1