[Mesa-dev] [PATCH 7/9] radeonsi: add support for easy opcodes from ARB_gpu_shader5

Wed Mar 11 03:49:22 PDT 2015

On Wed, Mar 11, 2015 at 2:54 AM, Tom Stellard <tom at stellard.net> wrote:
> On Wed, Mar 11, 2015 at 01:27:32AM +0100, Marek Olšák wrote:
>> On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <tom at stellard.net> wrote:
>> > On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote:
>> >> I've looked into how to recognize BFM and BFI and discovered that if
>> >> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the
>> >> pattern in the backend due to LLVM transformations. The reason it's
>> >> impossible is that one particular simplification of the expanded IR
>> >> can always be done and it always changes the IR in a way that BFI
>> >> can't be recognized anymore.
>> >>
>> >> The ideal transformation from TGSI to ISA is (note that this is also
>> >> how GLSL defines the opcode):
>> >>
>> >> TGSI_OPCODE_BFI(base, insert, offset, bits)
>> >> = BFI(BFM(bits, offset), SHL(insert, offset), base) =
>> >>     s_lshl_b32 s1, s4, s6
>> >>     s_bfm_b32 s0, s0, s6
>> >>     v_mov_b32_e32 v0, s5
>> >>     v_mov_b32_e32 v1, s1
>> >>     v_bfi_b32 v0, s0, v1, v0
>> >> Ideally 3 instructions if all sources are vector registers.
>> >>
>> >> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations
>> >> (BTW the result of BFM has 2 uses in BFI), LLVM applies this
>> >> transformation:
>> >>     (X << S) & (Y << S) ----> (X & Y) << S
>> >> Which breaks recognition of BFI and also the second use of BFM.
>> >> Therefore this version calculates the same BFM expression twice. The
>> >> first BFM is recognized by pattern matching, but the second BFM as
>> >> well as BFI is unrecognizable due to the transformation. The result
>> >> is:
>> >>     s_lshl_b32 s1, 1, s0
>> >>     s_bfm_b32 s0, s0, s6
>> >>     s_add_i32 s1, s1, -1
>> >>     s_not_b32 s0, s0
>> >>     s_and_b32 s1, s1, s4
>> >>     s_and_b32 s0, s0, s5
>> >>     s_lshl_b32 s1, s1, s6
>> >>     s_or_b32 s0, s1, s0
>> >>
>> >> There are 2 ways out of this:
>> >>
>> >> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in
>> >> the future.
>> >>
>> >> 2) Try to recognize the expression tree seen by the backend. Changes
>> >> in LLVM core can break it. More complicated shaders with more
>> >> opportunities for transformations can break it too:
>> >>
>> >> def : Pat <
>> >>   (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
>> >>             -1)), i32:$b)), i32:$c)),
>> >>            (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
>> >>             -1)), i32:$c)), -1)), i32:$d)))),
>> >>   (V_BFI_B32 (S_BFM_B32 $a, $c),
>> >>              (S_LSHL_B32 $b, $c),
>> >>              $d)
>> >> >;
>> >
>> > I don't want to waste a lot of time discussing this, because it probably
>> > doesn't matter too much in the long run.  I'm fine with using
>> > intrinsics, but I just wanted to point out a few things in case you or
>> > someone else wants to get this working using LLVM IR.
>> >
>> > 1. Running the instruction combining pass should help with pattern
>> > matching.  This transforms common sequence into canonical forms which
>> > make them easier to match in the backend.  We should be running this
>> > pass anyway as it has some good optimization.
>> >
>> >
>> > diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
>> > index dce5b55..45c9eb8 100644
>> > --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
>> > +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
>> > @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct
>> > radeon_llvm_context * ctx)
>> >         LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr);
>> >
>> >         /* Add some optimization passes */
>> > +       LLVMAddInstructionCombiningPass(gallivm->passmgr);
>> >         LLVMAddScalarReplAggregatesPass(gallivm->passmgr);
>> >         LLVMAddLICMPass(gallivm->passmgr);
>> >         LLVMAddAggressiveDCEPass(gallivm->passmgr);
>>
>> I tried this a long time ago and it broke a few tests which used the
>> kill intrinsic.
>>
>> I'm testing it right now and it increases the shader binary size quite
>> a lot. It looks like some DAG combines don't work with it anymore and
>> the generated code looks worse overall.
>>
>> I occasionally use llc for debugging, which doesn't seem to use it
>> either? Anyway, it looks like there's a lot of work needed just to fix
>> the worse code generation. And when Mesa starts to use it, llc should
>> use it too.
>>
>
> When mesa dumps the LLVM IR, all the passes we add in radeonsi have
> already been run, so we wouldn't need to add it to llc.
>
>>
>> >
>> >
>> >
>> >
>> > 2. We have a number of patterns for BFI already in AMDGPUInstructions.td
>> > If we are modeling BFI using LLVM IR, we should be using one of those
>> > patterns.
>>
>> All of them are useless for TGSI_OPCODE_BFI.
>>
>
> I guess I don't understand why.  If you implemented TGSI_OPCODE_BFI
> with the bfm intrinsic and an open-coded bfi.  Couldn't you match
> the open-coded bfi to one of the patterns in AMDGPUInstructions.td?
>
>>
>> >
>> > 3. The BFI, BFM, LSHL sequence does not need to be matched in a single
>> > pattern.  There should be one pattern for BFI, one for BFM, and one for
>> > LSHL.
>>
>> This won't work for TGSI, which I tried to explain in detail. Matching
>> the whole pattern is the only way. The big pattern can recognize a
>> hidden duplicated expression in it which the CSE can't. Any strict
>> subset of the pattern doesn't make any sense alone.
>>
>
> Ok, perhaps I didn't really understand your example.  Would you
> be able to send the LLVM IR you used.

I'll show you where the problem is. This pattern received by the
backend can't be matched with a separate BFI pattern:

(i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
          -1)), i32:$b)), i32:$c)),
         (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
          -1)), i32:$c)), -1)), i32:$d))))

But it can be matched after expanding it using this distributive law:
(a & b) << c ---> (a << c) & (b << c)
This is an anti-optimization, but it's required here. It should
transform the first occurence of "shl .. and" into "and .. shl" on the
first line. After that, BFI is obvious and looks like this:

(i32 (or (i32 (and ...),
         (i32 (and (i32 (xor ..., -1)), ...))))

And even one BFM and one SHL are now well-formed there.

Summary:
- BFI can be matched using the input LLVM IR.
- LLVM applies this transformation during one of its passes:
(a << c) & (b << c) ---> (a & b) << c
- The backend can no longer trivially match BFI and produces horrible code.
- A big pattern is required to reverse the transformation.

Marek