[Mesa-dev] [PATCH 7/9] radeonsi: add support for easy opcodes from ARB_gpu_shader5

Tue Mar 10 18:54:42 PDT 2015

On Wed, Mar 11, 2015 at 01:27:32AM +0100, Marek Olšák wrote:
> On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <tom at stellard.net> wrote:
> > On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote:
> >> I've looked into how to recognize BFM and BFI and discovered that if
> >> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the
> >> pattern in the backend due to LLVM transformations. The reason it's
> >> impossible is that one particular simplification of the expanded IR
> >> can always be done and it always changes the IR in a way that BFI
> >> can't be recognized anymore.
> >>
> >> The ideal transformation from TGSI to ISA is (note that this is also
> >> how GLSL defines the opcode):
> >>
> >> TGSI_OPCODE_BFI(base, insert, offset, bits)
> >> = BFI(BFM(bits, offset), SHL(insert, offset), base) =
> >>     s_lshl_b32 s1, s4, s6
> >>     s_bfm_b32 s0, s0, s6
> >>     v_mov_b32_e32 v0, s5
> >>     v_mov_b32_e32 v1, s1
> >>     v_bfi_b32 v0, s0, v1, v0
> >> Ideally 3 instructions if all sources are vector registers.
> >>
> >> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations
> >> (BTW the result of BFM has 2 uses in BFI), LLVM applies this
> >> transformation:
> >>     (X << S) & (Y << S) ----> (X & Y) << S
> >> Which breaks recognition of BFI and also the second use of BFM.
> >> Therefore this version calculates the same BFM expression twice. The
> >> first BFM is recognized by pattern matching, but the second BFM as
> >> well as BFI is unrecognizable due to the transformation. The result
> >> is:
> >>     s_lshl_b32 s1, 1, s0
> >>     s_bfm_b32 s0, s0, s6
> >>     s_add_i32 s1, s1, -1
> >>     s_not_b32 s0, s0
> >>     s_and_b32 s1, s1, s4
> >>     s_and_b32 s0, s0, s5
> >>     s_lshl_b32 s1, s1, s6
> >>     s_or_b32 s0, s1, s0
> >>
> >> There are 2 ways out of this:
> >>
> >> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in
> >> the future.
> >>
> >> 2) Try to recognize the expression tree seen by the backend. Changes
> >> in LLVM core can break it. More complicated shaders with more
> >> opportunities for transformations can break it too:
> >>
> >> def : Pat <
> >>   (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
> >>             -1)), i32:$b)), i32:$c)),
> >>            (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
> >>             -1)), i32:$c)), -1)), i32:$d)))),
> >>   (V_BFI_B32 (S_BFM_B32 $a, $c),
> >>              (S_LSHL_B32 $b, $c),
> >>              $d)
> >> >;
> >
> > I don't want to waste a lot of time discussing this, because it probably
> > doesn't matter too much in the long run.  I'm fine with using
> > intrinsics, but I just wanted to point out a few things in case you or
> > someone else wants to get this working using LLVM IR.
> >
> > 1. Running the instruction combining pass should help with pattern
> > matching.  This transforms common sequence into canonical forms which
> > make them easier to match in the backend.  We should be running this
> > pass anyway as it has some good optimization.
> >
> >
> > diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > index dce5b55..45c9eb8 100644
> > --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct
> > radeon_llvm_context * ctx)
> >         LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr);
> >
> >         /* Add some optimization passes */
> > +       LLVMAddInstructionCombiningPass(gallivm->passmgr);
> >         LLVMAddScalarReplAggregatesPass(gallivm->passmgr);
> >         LLVMAddLICMPass(gallivm->passmgr);
> >         LLVMAddAggressiveDCEPass(gallivm->passmgr);
> 
> I tried this a long time ago and it broke a few tests which used the
> kill intrinsic.
> 
> I'm testing it right now and it increases the shader binary size quite
> a lot. It looks like some DAG combines don't work with it anymore and
> the generated code looks worse overall.
> 
> I occasionally use llc for debugging, which doesn't seem to use it
> either? Anyway, it looks like there's a lot of work needed just to fix
> the worse code generation. And when Mesa starts to use it, llc should
> use it too.
> 

When mesa dumps the LLVM IR, all the passes we add in radeonsi have
already been run, so we wouldn't need to add it to llc.

> 
> >
> >
> >
> >
> > 2. We have a number of patterns for BFI already in AMDGPUInstructions.td
> > If we are modeling BFI using LLVM IR, we should be using one of those
> > patterns.
> 
> All of them are useless for TGSI_OPCODE_BFI.
> 

I guess I don't understand why.  If you implemented TGSI_OPCODE_BFI
with the bfm intrinsic and an open-coded bfi.  Couldn't you match
the open-coded bfi to one of the patterns in AMDGPUInstructions.td?

> 
> >
> > 3. The BFI, BFM, LSHL sequence does not need to be matched in a single
> > pattern.  There should be one pattern for BFI, one for BFM, and one for
> > LSHL.
> 
> This won't work for TGSI, which I tried to explain in detail. Matching
> the whole pattern is the only way. The big pattern can recognize a
> hidden duplicated expression in it which the CSE can't. Any strict
> subset of the pattern doesn't make any sense alone.
> 

Ok, perhaps I didn't really understand your example.  Would you
be able to send the LLVM IR you used.

> 
> >
> > 4. Using intrinsics means no constant folding will be done, which could
> > lead to worse code in some cases.
> 
> This is not a big issue. The GLSL compiler understands these
> instrinsics and takes care of constant folding.
> 

I forgot about that.  I was also thinking about the case where we use
an intrinsic internally in the driver (as I think you did in one of
the patches), but it seems like this would be even less of an issue.

-Tom

> Marek