[Mesa-dev] [PATCH 7/9] radeonsi: add support for easy opcodes from ARB_gpu_shader5
Marek Olšák
maraeo at gmail.com
Tue Mar 10 17:27:32 PDT 2015
On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <tom at stellard.net> wrote:
> On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote:
>> I've looked into how to recognize BFM and BFI and discovered that if
>> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the
>> pattern in the backend due to LLVM transformations. The reason it's
>> impossible is that one particular simplification of the expanded IR
>> can always be done and it always changes the IR in a way that BFI
>> can't be recognized anymore.
>>
>> The ideal transformation from TGSI to ISA is (note that this is also
>> how GLSL defines the opcode):
>>
>> TGSI_OPCODE_BFI(base, insert, offset, bits)
>> = BFI(BFM(bits, offset), SHL(insert, offset), base) =
>> s_lshl_b32 s1, s4, s6
>> s_bfm_b32 s0, s0, s6
>> v_mov_b32_e32 v0, s5
>> v_mov_b32_e32 v1, s1
>> v_bfi_b32 v0, s0, v1, v0
>> Ideally 3 instructions if all sources are vector registers.
>>
>> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations
>> (BTW the result of BFM has 2 uses in BFI), LLVM applies this
>> transformation:
>> (X << S) & (Y << S) ----> (X & Y) << S
>> Which breaks recognition of BFI and also the second use of BFM.
>> Therefore this version calculates the same BFM expression twice. The
>> first BFM is recognized by pattern matching, but the second BFM as
>> well as BFI is unrecognizable due to the transformation. The result
>> is:
>> s_lshl_b32 s1, 1, s0
>> s_bfm_b32 s0, s0, s6
>> s_add_i32 s1, s1, -1
>> s_not_b32 s0, s0
>> s_and_b32 s1, s1, s4
>> s_and_b32 s0, s0, s5
>> s_lshl_b32 s1, s1, s6
>> s_or_b32 s0, s1, s0
>>
>> There are 2 ways out of this:
>>
>> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in
>> the future.
>>
>> 2) Try to recognize the expression tree seen by the backend. Changes
>> in LLVM core can break it. More complicated shaders with more
>> opportunities for transformations can break it too:
>>
>> def : Pat <
>> (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
>> -1)), i32:$b)), i32:$c)),
>> (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
>> -1)), i32:$c)), -1)), i32:$d)))),
>> (V_BFI_B32 (S_BFM_B32 $a, $c),
>> (S_LSHL_B32 $b, $c),
>> $d)
>> >;
>
> I don't want to waste a lot of time discussing this, because it probably
> doesn't matter too much in the long run. I'm fine with using
> intrinsics, but I just wanted to point out a few things in case you or
> someone else wants to get this working using LLVM IR.
>
> 1. Running the instruction combining pass should help with pattern
> matching. This transforms common sequence into canonical forms which
> make them easier to match in the backend. We should be running this
> pass anyway as it has some good optimization.
>
>
> diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> index dce5b55..45c9eb8 100644
> --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct
> radeon_llvm_context * ctx)
> LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr);
>
> /* Add some optimization passes */
> + LLVMAddInstructionCombiningPass(gallivm->passmgr);
> LLVMAddScalarReplAggregatesPass(gallivm->passmgr);
> LLVMAddLICMPass(gallivm->passmgr);
> LLVMAddAggressiveDCEPass(gallivm->passmgr);
I tried this a long time ago and it broke a few tests which used the
kill intrinsic.
I'm testing it right now and it increases the shader binary size quite
a lot. It looks like some DAG combines don't work with it anymore and
the generated code looks worse overall.
I occasionally use llc for debugging, which doesn't seem to use it
either? Anyway, it looks like there's a lot of work needed just to fix
the worse code generation. And when Mesa starts to use it, llc should
use it too.
>
>
>
>
> 2. We have a number of patterns for BFI already in AMDGPUInstructions.td
> If we are modeling BFI using LLVM IR, we should be using one of those
> patterns.
All of them are useless for TGSI_OPCODE_BFI.
>
> 3. The BFI, BFM, LSHL sequence does not need to be matched in a single
> pattern. There should be one pattern for BFI, one for BFM, and one for
> LSHL.
This won't work for TGSI, which I tried to explain in detail. Matching
the whole pattern is the only way. The big pattern can recognize a
hidden duplicated expression in it which the CSE can't. Any strict
subset of the pattern doesn't make any sense alone.
>
> 4. Using intrinsics means no constant folding will be done, which could
> lead to worse code in some cases.
This is not a big issue. The GLSL compiler understands these
instrinsics and takes care of constant folding.
Marek
More information about the mesa-dev
mailing list