[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Wed Sep 6 22:58:55 UTC 2017

Hi Connor and Curro,

On 28/08/17 12:24, Alejandro Piñeiro wrote:
> On 27/08/17 20:24, Connor Abbott wrote:
>> Hi,
>>
>> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com
>> <mailto:apinheiro at igalia.com>> wrote:
>>
>>     On 24/08/17 21:07, Connor Abbott wrote:
>>     >
>>     > Hi Alejandro,
>>
>>     Hi Connor,
>>
>>     >
>>     > This seems really suspicious. If the live ranges are really
>>     > independent, then the register allocator should be able to
>>     assign the
>>     > two virtual registers to the same physical register if it needs to.
>>
>>     Yes, it is true, the register allocator should be able to assign two
>>     virtual registers to the same physical register. But that is done
>>     at the
>>     end (or really near the end), so late for the problem this
>>     optimization
>>     is trying to fix.
>>
>>
>> Well, my understanding is that the problem is long compilation times
>> due to spilling and our not-so-great implementation of it. So no,
>> register allocation is not late for the problem. As both Curro and I
>> explained, the change by itself can only pessimise register
>> allocation, so if it helps then it must be due to a bug in the
>> register allocator or a problem in a subsequent pass that's getting
>> hidden by this one.
> 
> Ok.
> 
>>
>>     We are also reducing the amount of instructions used.
>>
>>
>> The comments in the source code say otherwise. Any instructions
>> eliminated were from spilling, which this pass only accidentally reduces.
> 
> Yes, sorry, I explained myself poorly. The optimization itself doesn't
> remove any instructions. But using it reduces the final number of
> instructions, although as you say, they are likely due reducing the
> spilling.
> 
>>
>>
>>
>>     Probably not really clear on the commit message. When I say
>>     "reduce the
>>     pressure of the register allocator" I mean having a code that the
>>     register allocator would be able to handle without using too much
>>     time.
>>     The problem this optimization tries to solve is that for some 16
>>     bit CTS
>>     tests (some with matrices and geometry shaders), the amount of virtual
>>     registers used and instructions was really big. For the record,
>>     initially, some tests needed 24 min just to compile. Right now, thanks
>>     to other optimizations, the slower test without this optimization
>>     needs
>>     1min 30 seconds. Adding some hacky timestamps, the time used  at
>>     fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>>
>>     * While trying to schedule using the three available pre mode
>>     heuristics: 7 seconds
>>     * Allocation with spilling: 63 seconds
>>     * Final schedule using SCHEDULE_POST: 19 seconds
>>
>>     With this optimization, the total time goes down to 14 seconds (10
>>     + 0 +
>>     3 on the previous bullet point list).
>>
>>     One could argue that 1min 30 seconds is okish. But taking into account
>>     that it goes down to 14 seconds, even with some caveats (see below), I
>>     still think that it is worth to use the optimization.
>>
>>     And a final comment. For that same test, this is the final stats
>>     (using
>>     INTEL_DEBUG):
>>
>>      * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>>     130320 cycles. 15:9 spills:fills.
>>      * Without the optimization: SIMD8 shader: 12312 instructions. 0
>>     loops.
>>     174816 cycles. 751:1851 spills:fills.
>>
>>
>> So, the fact that it helps at all with SIMD8 shows that my theory is
>> wrong, but since your pass reduces spilling, it clearly must be
>> avoiding a bug somewhere else. You need to compare the IR for a shader
>> with the problem with and without this pass right before register
>> allocation. Maybe the sources and destinations of the conversion
>> instructions interfere without the change due to some other pass
>> that's increasing register pressure, in which case that's the problem,
>> but I doubt it.
> 
> Ok, thanks for the hints.

After some research we found that we need to adapt the live_variables
algorithm to support 32 to 16-bit conversions. Because of the HW
alignment restrictions these conversions need that the result register
uses stride=2, so it is not continuous (stride!=1) so by definition
is_partial_write returns true. Any of the next last 3 conditions could
be true when we use 16-bit types.

bool
fs_inst::is_partial_write() const
{
   return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
           (this->exec_size * type_sz(this->dst.type)) < 32 ||
           !this->dst.is_contiguous() ||
           this->dst.offset % REG_SIZE != 0);
}

So at the check on the setup_one_write function at
brw_fs_live_variables.cpp the variable isn't marked as defined
completely in the block.

   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
      if (!BITSET_TEST(bd->use, var))
         BITSET_SET(bd->def, var);
   }

That makes that the live start of the variable is expected to defined
before the beginning of the block. In the commented test by Alejandro we
have a big unrolled loop that increases the pressure of the register
artificially making all result registers of a conversion to start life
at instruction 0.

A simpler approach to the submitted optimization would be to consider as
a fully defined variable in a block the result of a 32-16 bit
conversion. This way the register allocator has more freedom to assign
the register because the live_interval is correct for this variable.
Something like the following code:

diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
b/src/intel/compiler/brw_fs_live_variables.cpp
index c449672a51..98d8583eaa 100644
--- a/src/intel/compiler/brw_fs_live_variables.cpp
+++ b/src/intel/compiler/brw_fs_live_variables.cpp
@@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data
*bd, fs_inst *inst,
    /* The def[] bitset marks when an initialization in a block completely
     * screens off previous updates of that variable (VGRF channel).
     */
-   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
+   if (inst->dst.file == VGRF && (!inst->is_partial_write() ||
+                                  ((type_sz(inst->dst.type) == 2) &&
+                                   (inst->opcode == BRW_OPCODE_MOV) &&
+                                   (type_sz(inst->src[0].type) == 4)))) {
       if (!BITSET_TEST(bd->use, var))
          BITSET_SET(bd->def, var);
    }

This new code reduces the spills from 1217 to 1089 in our worse tests
compared to using the submitted optimization and without regressions.

Although this is a simple fix that address the problem with the
conversions (that is the main use that 16bit arithmetic of the
VK_KHR_16bit_storage extension), there are other cases that we didn't
addressed with the reuse_16bit_conversions_register that need a more
complex approach.

I'm thinking about the cases when 16-bit types are packed/unpacked or
shuffled/unshuffled in the storage operations. They are also
partial_writes but the aggregation of several operations make that the
register could be considered as completely defined in a block.

- This would need doing a "subregister" tracking for stride=2 and
offset=[0,2] for shuffled components that are written using 32-bit
storage operations.
- And the same for packet 16-bit values on SIMD8 that use half register
continuously.

So if only half of the register is defined for a given variable and only
that half is read by the shader we can consider consistent that the
variable is completely defined. But in the case that the different
halves are defined in different blocks things get complicated but could
stay as partial write.

Does it makes sense to go deeper with an approach like this one? Maybe
is not strictly needed for this series but it could be an interesting
follow up for adjusting the liveness of the payloads of the 16-bit store
operations and any future use of 16-bit types.

Thanks in advance for your feedback.

Chema

>> (IIRC there's a debug option to print out the register pressure for
>> each instruction, which will be helpful here).
> 
> I tried to use that option back when I was working on this patch,
> without too much luck. Will try again.
> 
>> If that's not the case, you should check to see if the register
>> allocator thinks the sources and destinations of the conversion
>> instructions interfere, and if so figure out why - that will likely be
>> the real bug.
> 
> Ok.
> 
> Thanks Connor and Curro for the comments. I will work on a different
> solution.