[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Thu Sep 7 17:34:18 UTC 2017

On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <jmcasanova at igalia.com>
wrote:

> Hi Connor and Curro,
>
> On 28/08/17 12:24, Alejandro Piñeiro wrote:
> > On 27/08/17 20:24, Connor Abbott wrote:
> >> Hi,
> >>
> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com
> >> <mailto:apinheiro at igalia.com>> wrote:
> >>
> >>     On 24/08/17 21:07, Connor Abbott wrote:
> >>     >
> >>     > Hi Alejandro,
> >>
> >>     Hi Connor,
> >>
> >>     >
> >>     > This seems really suspicious. If the live ranges are really
> >>     > independent, then the register allocator should be able to
> >>     assign the
> >>     > two virtual registers to the same physical register if it needs
> to.
> >>
> >>     Yes, it is true, the register allocator should be able to assign two
> >>     virtual registers to the same physical register. But that is done
> >>     at the
> >>     end (or really near the end), so late for the problem this
> >>     optimization
> >>     is trying to fix.
> >>
> >>
> >> Well, my understanding is that the problem is long compilation times
> >> due to spilling and our not-so-great implementation of it. So no,
> >> register allocation is not late for the problem. As both Curro and I
> >> explained, the change by itself can only pessimise register
> >> allocation, so if it helps then it must be due to a bug in the
> >> register allocator or a problem in a subsequent pass that's getting
> >> hidden by this one.
> >
> > Ok.
> >
> >>
> >>     We are also reducing the amount of instructions used.
> >>
> >>
> >> The comments in the source code say otherwise. Any instructions
> >> eliminated were from spilling, which this pass only accidentally
> reduces.
> >
> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
> > remove any instructions. But using it reduces the final number of
> > instructions, although as you say, they are likely due reducing the
> > spilling.
> >
> >>
> >>
> >>
> >>     Probably not really clear on the commit message. When I say
> >>     "reduce the
> >>     pressure of the register allocator" I mean having a code that the
> >>     register allocator would be able to handle without using too much
> >>     time.
> >>     The problem this optimization tries to solve is that for some 16
> >>     bit CTS
> >>     tests (some with matrices and geometry shaders), the amount of
> virtual
> >>     registers used and instructions was really big. For the record,
> >>     initially, some tests needed 24 min just to compile. Right now,
> thanks
> >>     to other optimizations, the slower test without this optimization
> >>     needs
> >>     1min 30 seconds. Adding some hacky timestamps, the time used  at
> >>     fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
> >>
> >>     * While trying to schedule using the three available pre mode
> >>     heuristics: 7 seconds
> >>     * Allocation with spilling: 63 seconds
> >>     * Final schedule using SCHEDULE_POST: 19 seconds
> >>
> >>     With this optimization, the total time goes down to 14 seconds (10
> >>     + 0 +
> >>     3 on the previous bullet point list).
> >>
> >>     One could argue that 1min 30 seconds is okish. But taking into
> account
> >>     that it goes down to 14 seconds, even with some caveats (see
> below), I
> >>     still think that it is worth to use the optimization.
> >>
> >>     And a final comment. For that same test, this is the final stats
> >>     (using
> >>     INTEL_DEBUG):
> >>
> >>      * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> >>     130320 cycles. 15:9 spills:fills.
> >>      * Without the optimization: SIMD8 shader: 12312 instructions. 0
> >>     loops.
> >>     174816 cycles. 751:1851 spills:fills.
> >>
> >>
> >> So, the fact that it helps at all with SIMD8 shows that my theory is
> >> wrong, but since your pass reduces spilling, it clearly must be
> >> avoiding a bug somewhere else. You need to compare the IR for a shader
> >> with the problem with and without this pass right before register
> >> allocation. Maybe the sources and destinations of the conversion
> >> instructions interfere without the change due to some other pass
> >> that's increasing register pressure, in which case that's the problem,
> >> but I doubt it.
> >
> > Ok, thanks for the hints.
>
> After some research we found that we need to adapt the live_variables
> algorithm to support 32 to 16-bit conversions. Because of the HW
> alignment restrictions these conversions need that the result register
> uses stride=2, so it is not continuous (stride!=1) so by definition
> is_partial_write returns true. Any of the next last 3 conditions could
> be true when we use 16-bit types.
>
> bool
> fs_inst::is_partial_write() const
> {
>    return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
>            (this->exec_size * type_sz(this->dst.type)) < 32 ||
>            !this->dst.is_contiguous() ||
>            this->dst.offset % REG_SIZE != 0);
> }
>
> So at the check on the setup_one_write function at
> brw_fs_live_variables.cpp the variable isn't marked as defined
> completely in the block.
>
>    if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>       if (!BITSET_TEST(bd->use, var))
>          BITSET_SET(bd->def, var);
>    }
>
> That makes that the live start of the variable is expected to defined
> before the beginning of the block. In the commented test by Alejandro we
> have a big unrolled loop that increases the pressure of the register
> artificially making all result registers of a conversion to start life
> at instruction 0.
>

Quick drive-by comment.  Curro and I talked about this quite a bit in the
office yesterday as I've been hitting similar issues with the vec2->double
conversion that we do when loading 64-bit values from UBOs and SSBOs.  He's
got a patch to liveness which makes registers which are only ever partially
written not extned their live ranges infinitely upward.  That's probably
easier than trying to track things per-component.

--Jason

> A simpler approach to the submitted optimization would be to consider as
> a fully defined variable in a block the result of a 32-16 bit
> conversion. This way the register allocator has more freedom to assign
> the register because the live_interval is correct for this variable.
> Something like the following code:
>
> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
> b/src/intel/compiler/brw_fs_live_variables.cpp
> index c449672a51..98d8583eaa 100644
> --- a/src/intel/compiler/brw_fs_live_variables.cpp
> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
> @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data
> *bd, fs_inst *inst,
>     /* The def[] bitset marks when an initialization in a block completely
>      * screens off previous updates of that variable (VGRF channel).
>      */
> -   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
> +   if (inst->dst.file == VGRF && (!inst->is_partial_write() ||
> +                                  ((type_sz(inst->dst.type) == 2) &&
> +                                   (inst->opcode == BRW_OPCODE_MOV) &&
> +                                   (type_sz(inst->src[0].type) == 4)))) {
>        if (!BITSET_TEST(bd->use, var))
>           BITSET_SET(bd->def, var);
>     }
>
> This new code reduces the spills from 1217 to 1089 in our worse tests
> compared to using the submitted optimization and without regressions.
>
> Although this is a simple fix that address the problem with the
> conversions (that is the main use that 16bit arithmetic of the
> VK_KHR_16bit_storage extension), there are other cases that we didn't
> addressed with the reuse_16bit_conversions_register that need a more
> complex approach.
>
> I'm thinking about the cases when 16-bit types are packed/unpacked or
> shuffled/unshuffled in the storage operations. They are also
> partial_writes but the aggregation of several operations make that the
> register could be considered as completely defined in a block.
>
> - This would need doing a "subregister" tracking for stride=2 and
> offset=[0,2] for shuffled components that are written using 32-bit
> storage operations.
> - And the same for packet 16-bit values on SIMD8 that use half register
> continuously.
>
> So if only half of the register is defined for a given variable and only
> that half is read by the shader we can consider consistent that the
> variable is completely defined. But in the case that the different
> halves are defined in different blocks things get complicated but could
> stay as partial write.
>
> Does it makes sense to go deeper with an approach like this one? Maybe
> is not strictly needed for this series but it could be an interesting
> follow up for adjusting the liveness of the payloads of the 16-bit store
> operations and any future use of 16-bit types.
>
> Thanks in advance for your feedback.
>
> Chema
>
> >> (IIRC there's a debug option to print out the register pressure for
> >> each instruction, which will be helpful here).
> >
> > I tried to use that option back when I was working on this patch,
> > without too much luck. Will try again.
> >
> >> If that's not the case, you should check to see if the register
> >> allocator thinks the sources and destinations of the conversion
> >> instructions interfere, and if so figure out why - that will likely be
> >> the real bug.
> >
> > Ok.
> >
> > Thanks Connor and Curro for the comments. I will work on a different
> > solution.
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170907/d5af9e6b/attachment.html>