[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization
Francisco Jerez
currojerez at riseup.net
Fri Sep 8 00:45:23 UTC 2017
Jason Ekstrand <jason at jlekstrand.net> writes:
> On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <jmcasanova at igalia.com>
> wrote:
>
>> Hi Connor and Curro,
>>
>> On 28/08/17 12:24, Alejandro Piñeiro wrote:
>> > On 27/08/17 20:24, Connor Abbott wrote:
>> >> Hi,
>> >>
>> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinheiro at igalia.com
>> >> <mailto:apinheiro at igalia.com>> wrote:
>> >>
>> >> On 24/08/17 21:07, Connor Abbott wrote:
>> >> >
>> >> > Hi Alejandro,
>> >>
>> >> Hi Connor,
>> >>
>> >> >
>> >> > This seems really suspicious. If the live ranges are really
>> >> > independent, then the register allocator should be able to
>> >> assign the
>> >> > two virtual registers to the same physical register if it needs
>> to.
>> >>
>> >> Yes, it is true, the register allocator should be able to assign two
>> >> virtual registers to the same physical register. But that is done
>> >> at the
>> >> end (or really near the end), so late for the problem this
>> >> optimization
>> >> is trying to fix.
>> >>
>> >>
>> >> Well, my understanding is that the problem is long compilation times
>> >> due to spilling and our not-so-great implementation of it. So no,
>> >> register allocation is not late for the problem. As both Curro and I
>> >> explained, the change by itself can only pessimise register
>> >> allocation, so if it helps then it must be due to a bug in the
>> >> register allocator or a problem in a subsequent pass that's getting
>> >> hidden by this one.
>> >
>> > Ok.
>> >
>> >>
>> >> We are also reducing the amount of instructions used.
>> >>
>> >>
>> >> The comments in the source code say otherwise. Any instructions
>> >> eliminated were from spilling, which this pass only accidentally
>> reduces.
>> >
>> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
>> > remove any instructions. But using it reduces the final number of
>> > instructions, although as you say, they are likely due reducing the
>> > spilling.
>> >
>> >>
>> >>
>> >>
>> >> Probably not really clear on the commit message. When I say
>> >> "reduce the
>> >> pressure of the register allocator" I mean having a code that the
>> >> register allocator would be able to handle without using too much
>> >> time.
>> >> The problem this optimization tries to solve is that for some 16
>> >> bit CTS
>> >> tests (some with matrices and geometry shaders), the amount of
>> virtual
>> >> registers used and instructions was really big. For the record,
>> >> initially, some tests needed 24 min just to compile. Right now,
>> thanks
>> >> to other optimizations, the slower test without this optimization
>> >> needs
>> >> 1min 30 seconds. Adding some hacky timestamps, the time used at
>> >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>> >>
>> >> * While trying to schedule using the three available pre mode
>> >> heuristics: 7 seconds
>> >> * Allocation with spilling: 63 seconds
>> >> * Final schedule using SCHEDULE_POST: 19 seconds
>> >>
>> >> With this optimization, the total time goes down to 14 seconds (10
>> >> + 0 +
>> >> 3 on the previous bullet point list).
>> >>
>> >> One could argue that 1min 30 seconds is okish. But taking into
>> account
>> >> that it goes down to 14 seconds, even with some caveats (see
>> below), I
>> >> still think that it is worth to use the optimization.
>> >>
>> >> And a final comment. For that same test, this is the final stats
>> >> (using
>> >> INTEL_DEBUG):
>> >>
>> >> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>> >> 130320 cycles. 15:9 spills:fills.
>> >> * Without the optimization: SIMD8 shader: 12312 instructions. 0
>> >> loops.
>> >> 174816 cycles. 751:1851 spills:fills.
>> >>
>> >>
>> >> So, the fact that it helps at all with SIMD8 shows that my theory is
>> >> wrong, but since your pass reduces spilling, it clearly must be
>> >> avoiding a bug somewhere else. You need to compare the IR for a shader
>> >> with the problem with and without this pass right before register
>> >> allocation. Maybe the sources and destinations of the conversion
>> >> instructions interfere without the change due to some other pass
>> >> that's increasing register pressure, in which case that's the problem,
>> >> but I doubt it.
>> >
>> > Ok, thanks for the hints.
>>
>> After some research we found that we need to adapt the live_variables
>> algorithm to support 32 to 16-bit conversions. Because of the HW
>> alignment restrictions these conversions need that the result register
>> uses stride=2, so it is not continuous (stride!=1) so by definition
>> is_partial_write returns true. Any of the next last 3 conditions could
>> be true when we use 16-bit types.
>>
>> bool
>> fs_inst::is_partial_write() const
>> {
>> return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
>> (this->exec_size * type_sz(this->dst.type)) < 32 ||
>> !this->dst.is_contiguous() ||
>> this->dst.offset % REG_SIZE != 0);
>> }
>>
>> So at the check on the setup_one_write function at
>> brw_fs_live_variables.cpp the variable isn't marked as defined
>> completely in the block.
>>
>> if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>> if (!BITSET_TEST(bd->use, var))
>> BITSET_SET(bd->def, var);
>> }
>>
>> That makes that the live start of the variable is expected to defined
>> before the beginning of the block. In the commented test by Alejandro we
>> have a big unrolled loop that increases the pressure of the register
>> artificially making all result registers of a conversion to start life
>> at instruction 0.
>>
>
> Quick drive-by comment. Curro and I talked about this quite a bit in the
> office yesterday as I've been hitting similar issues with the vec2->double
> conversion that we do when loading 64-bit values from UBOs and SSBOs. He's
> got a patch to liveness which makes registers which are only ever partially
> written not extned their live ranges infinitely upward. That's probably
> easier than trying to track things per-component.
>
FTR you can find my liveness analysis patch here:
https://lists.freedesktop.org/archives/mesa-dev/2017-September/168974.html
Chema/Alejandro, let me know if it doesn't have the intended effect on
shaders that use FP16, I haven't tested it in combination with your
series.
Thanks.
> --Jason
>
>
>> A simpler approach to the submitted optimization would be to consider as
>> a fully defined variable in a block the result of a 32-16 bit
>> conversion. This way the register allocator has more freedom to assign
>> the register because the live_interval is correct for this variable.
>> Something like the following code:
>>
>> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
>> b/src/intel/compiler/brw_fs_live_variables.cpp
>> index c449672a51..98d8583eaa 100644
>> --- a/src/intel/compiler/brw_fs_live_variables.cpp
>> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
>> @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data
>> *bd, fs_inst *inst,
>> /* The def[] bitset marks when an initialization in a block completely
>> * screens off previous updates of that variable (VGRF channel).
>> */
>> - if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>> + if (inst->dst.file == VGRF && (!inst->is_partial_write() ||
>> + ((type_sz(inst->dst.type) == 2) &&
>> + (inst->opcode == BRW_OPCODE_MOV) &&
>> + (type_sz(inst->src[0].type) == 4)))) {
>> if (!BITSET_TEST(bd->use, var))
>> BITSET_SET(bd->def, var);
>> }
>>
>> This new code reduces the spills from 1217 to 1089 in our worse tests
>> compared to using the submitted optimization and without regressions.
>>
>> Although this is a simple fix that address the problem with the
>> conversions (that is the main use that 16bit arithmetic of the
>> VK_KHR_16bit_storage extension), there are other cases that we didn't
>> addressed with the reuse_16bit_conversions_register that need a more
>> complex approach.
>>
>> I'm thinking about the cases when 16-bit types are packed/unpacked or
>> shuffled/unshuffled in the storage operations. They are also
>> partial_writes but the aggregation of several operations make that the
>> register could be considered as completely defined in a block.
>>
>> - This would need doing a "subregister" tracking for stride=2 and
>> offset=[0,2] for shuffled components that are written using 32-bit
>> storage operations.
>> - And the same for packet 16-bit values on SIMD8 that use half register
>> continuously.
>>
>> So if only half of the register is defined for a given variable and only
>> that half is read by the shader we can consider consistent that the
>> variable is completely defined. But in the case that the different
>> halves are defined in different blocks things get complicated but could
>> stay as partial write.
>>
>> Does it makes sense to go deeper with an approach like this one? Maybe
>> is not strictly needed for this series but it could be an interesting
>> follow up for adjusting the liveness of the payloads of the 16-bit store
>> operations and any future use of 16-bit types.
>>
>> Thanks in advance for your feedback.
>>
>> Chema
>>
>> >> (IIRC there's a debug option to print out the register pressure for
>> >> each instruction, which will be helpful here).
>> >
>> > I tried to use that option back when I was working on this patch,
>> > without too much luck. Will try again.
>> >
>> >> If that's not the case, you should check to see if the register
>> >> allocator thinks the sources and destinations of the conversion
>> >> instructions interfere, and if so figure out why - that will likely be
>> >> the real bug.
>> >
>> > Ok.
>> >
>> > Thanks Connor and Curro for the comments. I will work on a different
>> > solution.
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170907/aecbf73a/attachment.sig>
More information about the mesa-dev
mailing list