[Mesa-dev] i965: Spilling non-contiguous registers

Tue Mar 8 12:48:58 UTC 2016

Hi,

I am trying to improve register spilling for fp64 programs. Specifically
for the varying-packing-simple piglit test with double types. Because
this test uses all available varying slots, register pressure is
significant and spilling is necessary for it to pass, even for non-fp64
types.

The main obstacle for this to work with fp64 types is that the current
register spilling process discards registers that are not contiguous
(stride != 1) and fp64 needs to use strides of 2 all the time. According
to the comment in brw_fs_reg_allocate.cpp, this restriction is to avoid
generating bad assembly for smeared registers (stride = 0), so in theory
we should be able to just change the condition to only disallow spilling
of registers with stride 0.

That works, mostly. Specifically, we no longer fail to compile and the
test passes for a number of varyings up to ~120. Up to this point, it
seems the test only needs to spill registers that write with a stride of
2 and read with a stride of 1, which seems to work fine. However, beyond
that point, we need to spill registers that are read with a stride of 2,
and that fails consistently. Here is a minimal sample that fails:

0:  add(8) vgrf1:D, g2<0>:D, 1d 
1:  mov(8) vgrf5:DF, vgrf1:D 
2:  gen4_scratch_write(8) (mlen: 2) null:F, vgrf5+0.0:F (offset = 0) 
3:  gen4_scratch_write(8) (mlen: 2) null:F, vgrf5+1.0:F (offset = 32) 
4:  mov(8) vgrf3+0.0:UD, g1:UD NoMask WE_all 
5:  mov(8) vgrf3+1.0:F, g3:F 
6:  mov(8) vgrf3+2.0:F, g4:F 
7:  mov(8) vgrf3+3.0:F, g5:F 
8:  mov(8) vgrf3+4.0:F, g6:F 
9:  gen7_scratch_read(8) vgrf6+0.0:F,  (offset = 0) 
10: gen7_scratch_read(8) vgrf6+1.0:F,  (offset = 32) 
11: mov(8) vgrf3+5.0:F, vgrf6<2>:F 
12: gen7_scratch_read(8) vgrf7+0.0:F,  (offset = 0) 
13: gen7_scratch_read(8) vgrf7+1.0:F,  (offset = 32) 
14: mov(8) vgrf3+6.0:F, vgrf7+0.4<2>:F 
15: gen8_urb_write_simd8(8) (mlen: 9) (null):UD, vgrf3:F 

Since DF registers take twice as much space, the spilling code needs 2
writes and 2 reads for each spill/unspill. The final assembly for the
reads looks like this:

send(8)   g8<1>UW   g0<8,8,1>F
     data ( DC OWORD block read, 0, 0) mlen 1 rlen 1 { align1 1Q };
send(8)   g9<1>UW   g0<8,8,1>F
     data ( DC OWORD block read, 1, 0) mlen 1 rlen 1 { align1 1Q };
mov(8)    g122<1>F  g8<8,4,2>F                       { align1 1Q };
send(8)   g9<1>UW   g0<8,8,1>F
     data ( DC OWORD block read, 0, 0) mlen 1 rlen 1 { align1 1Q };
send(8)   g10<1>UW        g0<8,8,1>F
     data ( DC OWORD block read, 1, 0) mlen 1 rlen 1 { align1 1Q };
mov(8)    g123<1>F  g9.1<8,4,2>F                     { align1 1Q };
send(8)   null<1>F  g117<8,8,1>F
     urb 1 SIMD8 write mlen 9 rlen 0                 { align1 1Q EOT };

All this looks correct to me, so I was wondering if the issue here could
be related to the hardware not seeing that it needs both scratch reads
to complete before it emits the MOV that reads vgrf6<2>:F.  I have tried
various strategies to force a dependency between the strided read and
the result of both scratch reads (mostly adding code that uses both
scratch reads results before we do the strided MOV) but nothing seems to
make any difference, so the problem might have nothing to do with this
in the end. If this is not the problem, then I don't see what could be
causing something like this to fail...

Any ideas as to what could be going on here?

Thanks,
Iago