[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization
Francisco Jerez
currojerez at riseup.net
Fri Aug 25 18:35:54 UTC 2017
Alejandro Piñeiro <apinheiro at igalia.com> writes:
> On 24/08/17 21:07, Connor Abbott wrote:
>>
>> Hi Alejandro,
>
> Hi Connor,
>
>>
>> This seems really suspicious. If the live ranges are really
>> independent, then the register allocator should be able to assign the
>> two virtual registers to the same physical register if it needs to.
>
> Yes, it is true, the register allocator should be able to assign two
> virtual registers to the same physical register. But that is done at the
> end (or really near the end), so late for the problem this optimization
> is trying to fix. We are also reducing the amount of instructions used.
>
> Probably not really clear on the commit message. When I say "reduce the
> pressure of the register allocator" I mean having a code that the
> register allocator would be able to handle without using too much time.
> The problem this optimization tries to solve is that for some 16 bit CTS
> tests (some with matrices and geometry shaders), the amount of virtual
> registers used and instructions was really big. For the record,
> initially, some tests needed 24 min just to compile. Right now, thanks
> to other optimizations, the slower test without this optimization needs
> 1min 30 seconds. Adding some hacky timestamps, the time used at
> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
> * While trying to schedule using the three available pre mode
> heuristics: 7 seconds
> * Allocation with spilling: 63 seconds
> * Final schedule using SCHEDULE_POST: 19 seconds
>
> With this optimization, the total time goes down to 14 seconds (10 + 0 +
> 3 on the previous bullet point list).
>
> One could argue that 1min 30 seconds is okish. But taking into account
> that it goes down to 14 seconds, even with some caveats (see below), I
> still think that it is worth to use the optimization.
>
> And a final comment. For that same test, this is the final stats (using
> INTEL_DEBUG):
>
> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> 130320 cycles. 15:9 spills:fills.
> * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
> 174816 cycles. 751:1851 spills:fills.
>
>> This change forces the two to be the same, which constrains the
>> register allocator unecessarily and should make it worse, so I'm
>> confused as to why this would help at all.
>
> I didn't check that issue specifically, but I recently found that this
> optimization affects copy propagation/dead code eliminate. So there are
> still some room for improvement. But in any case, take into account that
> this custom optimization is only used if there is a 32 to 16 bit
> conversion, so only affects shaders with this specific feature.
>
>>
>> IIRC there were some issues where we unnecessarily made the sources
>> and destination of an instruction interefere with each other, but if
>> that's what's causing this, then we should fix that underlying issue.
>>
>> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
>> generator, in which case the second half of the source is read after
>> the first half of the destination is written, and we falsely thought
>> that the HW did that too, so we had some code to add a fake
>> interference between them, but a while ago Curro moved the expansion
>> to happen before register allocation. I don't have the code in front
>> of me, but I think we still have this useless code lying around, and I
>> would guess this is the source of the problem.)
>
> Taking into account what I explained before, I don't think that the
> problem is the interference or this code you mention (although perhaps
> Im wrong).
>
I agree with Connor's feed-back on this change, this really smells like
a hack working around register allocator brokenness. If the register
allocator is failing to assign two variables with disjoint live ranges
to the same register it has a bug. If you forcefully merge the live
ranges of source and destination it might turn out that that wasn't the
optimal decision to take after all register pressure-wise (because it's
frequently harder to find room in the GRF for a variable with 2x the
live range than for two independent variables), so you will be
pessimizing register usage in some cases -- The register allocator
should know better than you.
>> hhhh
>> Connor
>>
>> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <apinheiro at igalia.com
>> <mailto:apinheiro at igalia.com>> wrote:
>>
>> When dealing with HF/U/UW, it is usual having a register with a
>> F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
>> value. In those cases it would be possible to reuse the register where
>> the F value is initially stored instead of having two. Take also into
>> account that when operating with HF/U/UW, you would need to use the
>> full register (so stride 2). Packs/unpacks would be only useful when
>> loading/storing several HF/W/UW.
>>
>> Note that no instruction is removed. The main benefict is reducing the
>> amoung of registers used, so the pressure on the register allocator is
>> decreased with big shaders.
>>
>> Possibly this could be integrated into an existing optimization, at it
>> is even already done by the register allocator, but was far easier to
>> write and cleaner to read as a separate optimization.
>>
>> We found this issue when dealing with some Vulkan CTS tests that
>> needed several minutes to compile. Most of the time was spent on the
>> register allocator.
>>
>> Right now the optimization only handles 32 to 16 bit conversion. It
>> could be possible to do the equivalent for 16 to 32 bit too, but in
>> practice, we didn't need it.
>> ---
>> src/intel/compiler/brw_fs.cpp | 77
>> +++++++++++++++++++++++++++++++++++++++++++
>> src/intel/compiler/brw_fs.h | 1 +
>> 2 files changed, 78 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_fs.cpp
>> b/src/intel/compiler/brw_fs.cpp
>> index b6013a5ce85..1342150b44e 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -39,6 +39,7 @@
>> #include "compiler/glsl_types.h"
>> #include "compiler/nir/nir_builder.h"
>> #include "program/prog_parameter.h"
>> +#include "brw_fs_live_variables.h"
>>
>> using namespace brw;
>>
>> @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
>> return progress;
>> }
>>
>> +/**
>> + * When dealing with HF/W/UW, it is usual having a register with
>> a F/D/UD, and
>> + * then convert it to HF/W/UW, and not use again the F/D/UD
>> value. In those
>> + * cases it would be possible to reuse the register where the F
>> value is
>> + * initially stored instead of having two. Take also into account
>> that when
>> + * operating with HF/W/UW, you would need to use the full
>> register (so stride
>> + * 2). Packs/unpacks would be only useful when loading/storing
>> several
>> + * HF/W/UWs.
>> + *
>> + * So something like this:
>> + * mov(8) vgrf14<2>:HF, vgrf39:F
>> + *
>> + * Became:
>> + * mov(8) vgrf39<2>:HF, vgrf39:F
>> + *
>> + * Note that no instruction is removed. The main benefict is
>> reducing the
>> + * amoung of registers used, so the pressure on the register
>> allocator is
>> + * decreased with big shaders.
>> + */
>> +bool
>> +fs_visitor::reuse_16bit_conversions_vgrf()
>> +{
>> + bool progress = false;
>> + int ip = -1;
>> +
>> + calculate_live_intervals();
>> +
>> + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>> + ip++;
>> +
>> + if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
>> + continue;
>> +
>> + if (inst->opcode != BRW_OPCODE_MOV)
>> + continue;
>> +
>> + if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||
>> + type_sz(inst->src[0].type) != 4 || inst->src[0].stride
>> != 1) {
>> + continue;
>> + }
>> +
>> + int src_reg = inst->src[0].nr;
>> + int src_offset = inst->src[0].offset;
>> + unsigned src_var = live_intervals->var_from_vgrf[src_reg];
>> + int src_end = live_intervals->end[src_var];
>> + int dst_reg = inst->dst.nr <http://dst.nr>;
>> +
>> + if (src_end > ip)
>> + continue;
>> +
>> + foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
>> + if (scan_inst->dst.file == VGRF &&
>> + scan_inst->dst.nr <http://dst.nr> == dst_reg) {
>> + scan_inst->dst.nr <http://dst.nr> = src_reg;
>> + scan_inst->dst.offset = src_offset;
>> + progress = true;
>> + }
>> +
>> + for (int i = 0; i < scan_inst->sources; i++) {
>> + if (scan_inst->src[i].file == VGRF &&
>> + scan_inst->src[i].nr == dst_reg) {
>> + scan_inst->src[i].nr = src_reg;
>> + scan_inst->src[i].offset = src_offset;
>> + progress = true;
>> + }
>> + }
>> + }
>> + }
>> +
>> + if (progress)
>> + invalidate_live_intervals();
>> +
>> + return progress;
>> +}
>> +
>>
>> static void
>> clear_deps_for_inst_src(fs_inst *inst, bool *deps, int first_grf,
>> int grf_len)
>> @@ -5829,6 +5905,7 @@ fs_visitor::optimize()
>>
>> OPT(opt_drop_redundant_mov_to_flags);
>> OPT(remove_extra_rounding_modes);
>> + OPT(reuse_16bit_conversions_vgrf);
>>
>> do {
>> progress = false;
>> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
>> index b9476e69edb..2685f748b87 100644
>> --- a/src/intel/compiler/brw_fs.h
>> +++ b/src/intel/compiler/brw_fs.h
>> @@ -151,6 +151,7 @@ public:
>> bool dead_code_eliminate();
>> bool remove_duplicate_mrf_writes();
>> bool remove_extra_rounding_modes();
>> + bool reuse_16bit_conversions_vgrf();
>>
>> bool opt_sampler_eot();
>> bool virtual_grf_interferes(int a, int b);
>> --
>> 2.11.0
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org <mailto:mesa-dev at lists.freedesktop.org>
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>
>>
>>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170825/d33ec329/attachment.sig>
More information about the mesa-dev
mailing list