[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization
Jason Ekstrand
jason at jlekstrand.net
Thu Sep 7 17:34:18 UTC 2017
On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <jmcasanova at igalia.com>
wrote:
> Hi Connor and Curro,
>
> On 28/08/17 12:24, Alejandro Piñeiro wrote:
> > On 27/08/17 20:24, Connor Abbott wrote:
> >> Hi,
> >>
> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com
> >> <mailto:apinheiro at igalia.com>> wrote:
> >>
> >> On 24/08/17 21:07, Connor Abbott wrote:
> >> >
> >> > Hi Alejandro,
> >>
> >> Hi Connor,
> >>
> >> >
> >> > This seems really suspicious. If the live ranges are really
> >> > independent, then the register allocator should be able to
> >> assign the
> >> > two virtual registers to the same physical register if it needs
> to.
> >>
> >> Yes, it is true, the register allocator should be able to assign two
> >> virtual registers to the same physical register. But that is done
> >> at the
> >> end (or really near the end), so late for the problem this
> >> optimization
> >> is trying to fix.
> >>
> >>
> >> Well, my understanding is that the problem is long compilation times
> >> due to spilling and our not-so-great implementation of it. So no,
> >> register allocation is not late for the problem. As both Curro and I
> >> explained, the change by itself can only pessimise register
> >> allocation, so if it helps then it must be due to a bug in the
> >> register allocator or a problem in a subsequent pass that's getting
> >> hidden by this one.
> >
> > Ok.
> >
> >>
> >> We are also reducing the amount of instructions used.
> >>
> >>
> >> The comments in the source code say otherwise. Any instructions
> >> eliminated were from spilling, which this pass only accidentally
> reduces.
> >
> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
> > remove any instructions. But using it reduces the final number of
> > instructions, although as you say, they are likely due reducing the
> > spilling.
> >
> >>
> >>
> >>
> >> Probably not really clear on the commit message. When I say
> >> "reduce the
> >> pressure of the register allocator" I mean having a code that the
> >> register allocator would be able to handle without using too much
> >> time.
> >> The problem this optimization tries to solve is that for some 16
> >> bit CTS
> >> tests (some with matrices and geometry shaders), the amount of
> virtual
> >> registers used and instructions was really big. For the record,
> >> initially, some tests needed 24 min just to compile. Right now,
> thanks
> >> to other optimizations, the slower test without this optimization
> >> needs
> >> 1min 30 seconds. Adding some hacky timestamps, the time used at
> >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
> >>
> >> * While trying to schedule using the three available pre mode
> >> heuristics: 7 seconds
> >> * Allocation with spilling: 63 seconds
> >> * Final schedule using SCHEDULE_POST: 19 seconds
> >>
> >> With this optimization, the total time goes down to 14 seconds (10
> >> + 0 +
> >> 3 on the previous bullet point list).
> >>
> >> One could argue that 1min 30 seconds is okish. But taking into
> account
> >> that it goes down to 14 seconds, even with some caveats (see
> below), I
> >> still think that it is worth to use the optimization.
> >>
> >> And a final comment. For that same test, this is the final stats
> >> (using
> >> INTEL_DEBUG):
> >>
> >> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> >> 130320 cycles. 15:9 spills:fills.
> >> * Without the optimization: SIMD8 shader: 12312 instructions. 0
> >> loops.
> >> 174816 cycles. 751:1851 spills:fills.
> >>
> >>
> >> So, the fact that it helps at all with SIMD8 shows that my theory is
> >> wrong, but since your pass reduces spilling, it clearly must be
> >> avoiding a bug somewhere else. You need to compare the IR for a shader
> >> with the problem with and without this pass right before register
> >> allocation. Maybe the sources and destinations of the conversion
> >> instructions interfere without the change due to some other pass
> >> that's increasing register pressure, in which case that's the problem,
> >> but I doubt it.
> >
> > Ok, thanks for the hints.
>
> After some research we found that we need to adapt the live_variables
> algorithm to support 32 to 16-bit conversions. Because of the HW
> alignment restrictions these conversions need that the result register
> uses stride=2, so it is not continuous (stride!=1) so by definition
> is_partial_write returns true. Any of the next last 3 conditions could
> be true when we use 16-bit types.
>
> bool
> fs_inst::is_partial_write() const
> {
> return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
> (this->exec_size * type_sz(this->dst.type)) < 32 ||
> !this->dst.is_contiguous() ||
> this->dst.offset % REG_SIZE != 0);
> }
>
> So at the check on the setup_one_write function at
> brw_fs_live_variables.cpp the variable isn't marked as defined
> completely in the block.
>
> if (inst->dst.file == VGRF && !inst->is_partial_write()) {
> if (!BITSET_TEST(bd->use, var))
> BITSET_SET(bd->def, var);
> }
>
> That makes that the live start of the variable is expected to defined
> before the beginning of the block. In the commented test by Alejandro we
> have a big unrolled loop that increases the pressure of the register
> artificially making all result registers of a conversion to start life
> at instruction 0.
>
Quick drive-by comment. Curro and I talked about this quite a bit in the
office yesterday as I've been hitting similar issues with the vec2->double
conversion that we do when loading 64-bit values from UBOs and SSBOs. He's
got a patch to liveness which makes registers which are only ever partially
written not extned their live ranges infinitely upward. That's probably
easier than trying to track things per-component.
--Jason
> A simpler approach to the submitted optimization would be to consider as
> a fully defined variable in a block the result of a 32-16 bit
> conversion. This way the register allocator has more freedom to assign
> the register because the live_interval is correct for this variable.
> Something like the following code:
>
> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
> b/src/intel/compiler/brw_fs_live_variables.cpp
> index c449672a51..98d8583eaa 100644
> --- a/src/intel/compiler/brw_fs_live_variables.cpp
> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
> @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data
> *bd, fs_inst *inst,
> /* The def[] bitset marks when an initialization in a block completely
> * screens off previous updates of that variable (VGRF channel).
> */
> - if (inst->dst.file == VGRF && !inst->is_partial_write()) {
> + if (inst->dst.file == VGRF && (!inst->is_partial_write() ||
> + ((type_sz(inst->dst.type) == 2) &&
> + (inst->opcode == BRW_OPCODE_MOV) &&
> + (type_sz(inst->src[0].type) == 4)))) {
> if (!BITSET_TEST(bd->use, var))
> BITSET_SET(bd->def, var);
> }
>
> This new code reduces the spills from 1217 to 1089 in our worse tests
> compared to using the submitted optimization and without regressions.
>
> Although this is a simple fix that address the problem with the
> conversions (that is the main use that 16bit arithmetic of the
> VK_KHR_16bit_storage extension), there are other cases that we didn't
> addressed with the reuse_16bit_conversions_register that need a more
> complex approach.
>
> I'm thinking about the cases when 16-bit types are packed/unpacked or
> shuffled/unshuffled in the storage operations. They are also
> partial_writes but the aggregation of several operations make that the
> register could be considered as completely defined in a block.
>
> - This would need doing a "subregister" tracking for stride=2 and
> offset=[0,2] for shuffled components that are written using 32-bit
> storage operations.
> - And the same for packet 16-bit values on SIMD8 that use half register
> continuously.
>
> So if only half of the register is defined for a given variable and only
> that half is read by the shader we can consider consistent that the
> variable is completely defined. But in the case that the different
> halves are defined in different blocks things get complicated but could
> stay as partial write.
>
> Does it makes sense to go deeper with an approach like this one? Maybe
> is not strictly needed for this series but it could be an interesting
> follow up for adjusting the liveness of the payloads of the 16-bit store
> operations and any future use of 16-bit types.
>
> Thanks in advance for your feedback.
>
> Chema
>
> >> (IIRC there's a debug option to print out the register pressure for
> >> each instruction, which will be helpful here).
> >
> > I tried to use that option back when I was working on this patch,
> > without too much luck. Will try again.
> >
> >> If that's not the case, you should check to see if the register
> >> allocator thinks the sources and destinations of the conversion
> >> instructions interfere, and if so figure out why - that will likely be
> >> the real bug.
> >
> > Ok.
> >
> > Thanks Connor and Curro for the comments. I will work on a different
> > solution.
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170907/d5af9e6b/attachment.html>
More information about the mesa-dev
mailing list