[Mesa-dev] [PATCH 7/9] radeonsi: add support for easy opcodes from ARB_gpu_shader5
Tom Stellard
tom at stellard.net
Tue Mar 10 18:54:42 PDT 2015
On Wed, Mar 11, 2015 at 01:27:32AM +0100, Marek Olšák wrote:
> On Wed, Mar 11, 2015 at 12:09 AM, Tom Stellard <tom at stellard.net> wrote:
> > On Tue, Mar 10, 2015 at 11:01:21PM +0100, Marek Olšák wrote:
> >> I've looked into how to recognize BFM and BFI and discovered that if
> >> TGSI_OPCODE_BFI is expanded, it's _impossible_ to recognize the
> >> pattern in the backend due to LLVM transformations. The reason it's
> >> impossible is that one particular simplification of the expanded IR
> >> can always be done and it always changes the IR in a way that BFI
> >> can't be recognized anymore.
> >>
> >> The ideal transformation from TGSI to ISA is (note that this is also
> >> how GLSL defines the opcode):
> >>
> >> TGSI_OPCODE_BFI(base, insert, offset, bits)
> >> = BFI(BFM(bits, offset), SHL(insert, offset), base) =
> >> s_lshl_b32 s1, s4, s6
> >> s_bfm_b32 s0, s0, s6
> >> v_mov_b32_e32 v0, s5
> >> v_mov_b32_e32 v1, s1
> >> v_bfi_b32 v0, s0, v1, v0
> >> Ideally 3 instructions if all sources are vector registers.
> >>
> >> However, if TGSI_OPCODE_BFI is expanded into basic bitwise operations
> >> (BTW the result of BFM has 2 uses in BFI), LLVM applies this
> >> transformation:
> >> (X << S) & (Y << S) ----> (X & Y) << S
> >> Which breaks recognition of BFI and also the second use of BFM.
> >> Therefore this version calculates the same BFM expression twice. The
> >> first BFM is recognized by pattern matching, but the second BFM as
> >> well as BFI is unrecognizable due to the transformation. The result
> >> is:
> >> s_lshl_b32 s1, 1, s0
> >> s_bfm_b32 s0, s0, s6
> >> s_add_i32 s1, s1, -1
> >> s_not_b32 s0, s0
> >> s_and_b32 s1, s1, s4
> >> s_and_b32 s0, s0, s5
> >> s_lshl_b32 s1, s1, s6
> >> s_or_b32 s0, s1, s0
> >>
> >> There are 2 ways out of this:
> >>
> >> 1) Use BFM and BFI intrinsics in Mesa. Simple and unlikely to break in
> >> the future.
> >>
> >> 2) Try to recognize the expression tree seen by the backend. Changes
> >> in LLVM core can break it. More complicated shaders with more
> >> opportunities for transformations can break it too:
> >>
> >> def : Pat <
> >> (i32 (or (i32 (shl (i32 (and (i32 (add (i32 (shl 1, i32:$a)),
> >> -1)), i32:$b)), i32:$c)),
> >> (i32 (and (i32 (xor (i32 (shl (i32 (add (i32 (shl 1, i32:$a)),
> >> -1)), i32:$c)), -1)), i32:$d)))),
> >> (V_BFI_B32 (S_BFM_B32 $a, $c),
> >> (S_LSHL_B32 $b, $c),
> >> $d)
> >> >;
> >
> > I don't want to waste a lot of time discussing this, because it probably
> > doesn't matter too much in the long run. I'm fine with using
> > intrinsics, but I just wanted to point out a few things in case you or
> > someone else wants to get this working using LLVM IR.
> >
> > 1. Running the instruction combining pass should help with pattern
> > matching. This transforms common sequence into canonical forms which
> > make them easier to match in the backend. We should be running this
> > pass anyway as it has some good optimization.
> >
> >
> > diff --git a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > index dce5b55..45c9eb8 100644
> > --- a/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > +++ b/src/gallium/drivers/radeon/radeon_setup_tgsi_llvm.c
> > @@ -1444,6 +1444,7 @@ void radeon_llvm_finalize_module(struct
> > radeon_llvm_context * ctx)
> > LLVMAddPromoteMemoryToRegisterPass(gallivm->passmgr);
> >
> > /* Add some optimization passes */
> > + LLVMAddInstructionCombiningPass(gallivm->passmgr);
> > LLVMAddScalarReplAggregatesPass(gallivm->passmgr);
> > LLVMAddLICMPass(gallivm->passmgr);
> > LLVMAddAggressiveDCEPass(gallivm->passmgr);
>
> I tried this a long time ago and it broke a few tests which used the
> kill intrinsic.
>
> I'm testing it right now and it increases the shader binary size quite
> a lot. It looks like some DAG combines don't work with it anymore and
> the generated code looks worse overall.
>
> I occasionally use llc for debugging, which doesn't seem to use it
> either? Anyway, it looks like there's a lot of work needed just to fix
> the worse code generation. And when Mesa starts to use it, llc should
> use it too.
>
When mesa dumps the LLVM IR, all the passes we add in radeonsi have
already been run, so we wouldn't need to add it to llc.
>
> >
> >
> >
> >
> > 2. We have a number of patterns for BFI already in AMDGPUInstructions.td
> > If we are modeling BFI using LLVM IR, we should be using one of those
> > patterns.
>
> All of them are useless for TGSI_OPCODE_BFI.
>
I guess I don't understand why. If you implemented TGSI_OPCODE_BFI
with the bfm intrinsic and an open-coded bfi. Couldn't you match
the open-coded bfi to one of the patterns in AMDGPUInstructions.td?
>
> >
> > 3. The BFI, BFM, LSHL sequence does not need to be matched in a single
> > pattern. There should be one pattern for BFI, one for BFM, and one for
> > LSHL.
>
> This won't work for TGSI, which I tried to explain in detail. Matching
> the whole pattern is the only way. The big pattern can recognize a
> hidden duplicated expression in it which the CSE can't. Any strict
> subset of the pattern doesn't make any sense alone.
>
Ok, perhaps I didn't really understand your example. Would you
be able to send the LLVM IR you used.
>
> >
> > 4. Using intrinsics means no constant folding will be done, which could
> > lead to worse code in some cases.
>
> This is not a big issue. The GLSL compiler understands these
> instrinsics and takes care of constant folding.
>
I forgot about that. I was also thinking about the case where we use
an intrinsic internally in the driver (as I think you did in one of
the patches), but it seems like this would be even less of an issue.
-Tom
> Marek
More information about the mesa-dev
mailing list