[Mesa-dev] [PATCH 7/9] radeonsi: add support for easy opcodes from ARB_gpu_shader5

Tue Mar 10 17:27:32 PDT 2015

On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <tom at stellard.net> wrote:
> On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote:
>> I've looked into how to recognize BFM and BFI and discovered that if
>> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the
>> pattern in the backend due to LLVM transformations. The reason it's
>> impossible is that one particular simplification of the expanded IR
>> can always be done and it always changes the IR in a way that BFI
>> can't be recognized anymore.
>>
>> The ideal transformation from TGSI to ISA is (note that this is also
>> how GLSL defines the opcode):
>>
>> TGSI_OPCODE_BFI(base, insert, offset, bits)
>> = BFI(BFM(bits, offset), SHL(insert, offset), base) =
>>     s_lshl_b32 s1, s4, s6
>>     s_bfm_b32 s0, s0, s6
>>     v_mov_b32_e32 v0, s5
>>     v_mov_b32_e32 v1, s1
>>     v_bfi_b32 v0, s0, v1, v0
>> Ideally 3 instructions if all sources are vector registers.
>>
>> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations
>> (BTW the result of BFM has 2 uses in BFI), LLVM applies this
>> transformation:
>>     (X << S) & (Y << S) ----> (X & Y) << S
>> Which breaks recognition of BFI and also the second use of BFM.
>> Therefore this version calculates the same BFM expression twice. The
>> first BFM is recognized by pattern matching, but the second BFM as
>> well as BFI is unrecognizable due to the transformation. The result
>> is:
>>     s_lshl_b32 s1, 1, s0
>>     s_bfm_b32 s0, s0, s6
>>     s_add_i32 s1, s1, -1
>>     s_not_b32 s0, s0
>>     s_and_b32 s1, s1, s4
>>     s_and_b32 s0, s0, s5
>>     s_lshl_b32 s1, s1, s6
>>     s_or_b32 s0, s1, s0
>>
>> There are 2 ways out of this:
>>
>> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in
>> the future.
>>
>> 2) Try to recognize the expression tree seen by the backend. Changes
>> in LLVM core can break it. More complicated shaders with more
>> opportunities for transformations can break it too:
>>
>> def : Pat <
>>   (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
>>             -1)), i32:$b)), i32:$c)),
>>            (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
>>             -1)), i32:$c)), -1)), i32:$d)))),
>>   (V_BFI_B32 (S_BFM_B32 $a, $c),
>>              (S_LSHL_B32 $b, $c),
>>              $d)
>> >;
>
> I don't want to waste a lot of time discussing this, because it probably
> doesn't matter too much in the long run.  I'm fine with using
> intrinsics, but I just wanted to point out a few things in case you or
> someone else wants to get this working using LLVM IR.
>
> 1. Running the instruction combining pass should help with pattern
> matching.  This transforms common sequence into canonical forms which
> make them easier to match in the backend.  We should be running this
> pass anyway as it has some good optimization.
>
>
> diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> index dce5b55..45c9eb8 100644
> --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct
> radeon_llvm_context * ctx)
>         LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr);
>
>         /* Add some optimization passes */
> +       LLVMAddInstructionCombiningPass(gallivm->passmgr);
>         LLVMAddScalarReplAggregatesPass(gallivm->passmgr);
>         LLVMAddLICMPass(gallivm->passmgr);
>         LLVMAddAggressiveDCEPass(gallivm->passmgr);

I tried this a long time ago and it broke a few tests which used the
kill intrinsic.

I'm testing it right now and it increases the shader binary size quite
a lot. It looks like some DAG combines don't work with it anymore and
the generated code looks worse overall.

I occasionally use llc for debugging, which doesn't seem to use it
either? Anyway, it looks like there's a lot of work needed just to fix
the worse code generation. And when Mesa starts to use it, llc should
use it too.

>
>
>
>
> 2. We have a number of patterns for BFI already in AMDGPUInstructions.td
> If we are modeling BFI using LLVM IR, we should be using one of those
> patterns.

All of them are useless for TGSI_OPCODE_BFI.

>
> 3. The BFI, BFM, LSHL sequence does not need to be matched in a single
> pattern.  There should be one pattern for BFI, one for BFM, and one for
> LSHL.

This won't work for TGSI, which I tried to explain in detail. Matching
the whole pattern is the only way. The big pattern can recognize a
hidden duplicated expression in it which the CSE can't. Any strict
subset of the pattern doesn't make any sense alone.

>
> 4. Using intrinsics means no constant folding will be done, which could
> lead to worse code in some cases.

This is not a big issue. The GLSL compiler understands these
instrinsics and takes care of constant folding.

Marek