[Mesa-dev] [PATCH] nv50/ir: optimize ADD(SHL(a, b), c) to SHLADD(a, b, c)
Karol Herbst
karolherbst at gmail.com
Sun Oct 9 19:28:14 UTC 2016
2016-10-09 13:58 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>
>
> On 10/08/2016 10:04 PM, Karol Herbst wrote:
>>
>> looks great, a few comments below
>
>
> Thanks!
>
>>
>> 2016-10-08 21:55 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>>>
>>> total instructions in shared programs :2286901 -> 2284473 (-0.11%)
>>> total gprs used in shared programs :335256 -> 335273 (0.01%)
>>> total local used in shared programs :31968 -> 31968 (0.00%)
>>>
>>> local gpr inst bytes
>>> helped 0 41 852 852
>>> hurt 0 44 23 23
>>>
>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>> ---
>>> .../drivers/nouveau/codegen/nv50_ir_peephole.cpp | 94
>>> ++++++++++++++++++++++
>>> 1 file changed, 94 insertions(+)
>>>
>>> diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>> b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>> index 6efb29e..caf5d1d 100644
>>> --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>> +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>> @@ -2132,6 +2132,99 @@ AlgebraicOpt::visit(BasicBlock *bb)
>>>
>>> //
>>> =============================================================================
>>>
>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>> +class LateAlgebraicOpt : public Pass
>>> +{
>>> +private:
>>> + virtual bool visit(BasicBlock *);
>>> +
>>> + void handleADD(Instruction *);
>>> + bool tryADDToSHLADD(Instruction *);
>>> +
>>> + BuildUtil bld;
>>> +};
>>> +
>>> +void
>>> +LateAlgebraicOpt::handleADD(Instruction *add)
>>> +{
>>> + Value *src0 = add->getSrc(0);
>>> + Value *src1 = add->getSrc(1);
>>> +
>>> + if (src0->reg.file != FILE_GPR || src1->reg.file != FILE_GPR)
>>> + return;
>>> +
>>> + if (prog->getTarget()->isOpSupported(OP_SHLADD, add->dType))
>>> + tryADDToSHLADD(add);
>>> +}
>>> +
>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>> +bool
>>> +LateAlgebraicOpt::tryADDToSHLADD(Instruction *add)
>>> +{
>>> + Value *src0 = add->getSrc(0);
>>> + Value *src1 = add->getSrc(1);
>>> + Modifier mod[2];
>>> + Value *src;
>>> + int s;
>>> +
>>> + if (add->saturate || add->usesFlags() || typeSizeof(add->dType) == 8)
>>> + return false;
>>> +
>>> + if (src0->getUniqueInsn() && src0->getUniqueInsn()->op == OP_SHL)
>>> + s = 0;
>>> + else
>>> + if (src1->getUniqueInsn() && src1->getUniqueInsn()->op == OP_SHL)
>>> + s = 1;
>>> + else
>>> + return false;
>>> +
>>> + src = add->getSrc(s);
>>> +
>>> + if (src->getUniqueInsn()->bb != add->bb)
>>> + return false;
>>
>>
>> is there a reason to check for the bb here?
>
>
> Those optimizations like ADD->MAD, ADD->SAD, etc, have to be local (ie.
> inside the same basic block). I just apply the same rule here.
>
>>
>>> +
>>> + if (src->getInsn()->usesFlags() || src->getInsn()->subOp)
>>> + return false;
>>> +
>>> + if (src->getInsn()->src(1).getFile() != FILE_IMMEDIATE)
>>> + return false;
>>> +
>>> + mod[0] = add->src(0).mod;
>>> + mod[1] = add->src(1).mod;
>>> +
>>> + add->op = OP_SHLADD;
>>> + add->setSrc(2, add->src(s ? 0 : 1));
>>
>>
>> I usually use "s ^ 1" instead.
>
>
> Yeah, but 's ? 0 : 1' seems to be more used than 's ^ 1' (in peephole).
>
>>
>>> + add->src(2).mod = mod[s];
>>> +
>>> + add->setSrc(0, src->getInsn()->getSrc(0));
>>> + add->src(0).mod = mod[0];
>>> + add->setSrc(1, src->getInsn()->getSrc(1));
>>> + add->src(1).mod = Modifier(0);
>>> +
>>> + return true;
>>> +}
>>> +
>>> +bool
>>> +LateAlgebraicOpt::visit(BasicBlock *bb)
>>> +{
>>> + Instruction *next;
>>> +
>>> + for (Instruction *i = bb->getEntry(); i; i = next) {
>>> + next = i->next;
>>> + switch (i->op) {
>>> + case OP_ADD:
>>> + handleADD(i);
>>> + break;
>>> + default:
>>> + break;
>>> + }
>>> + }
>>
>>
>> you can also just implement visit(Instruction *) and return true; then
>> you won't need that silly loop.
>
>
> True.
>
>>
>>> +
>>> + return true;
>>> +}
>>> +
>>> +//
>>> =============================================================================
>>> +
>>> static inline void
>>> updateLdStOffset(Instruction *ldst, int32_t offset, Function *fn)
>>> {
>>> @@ -3436,6 +3529,7 @@ Program::optimizeSSA(int level)
>>> RUN_PASS(2, AlgebraicOpt, run);
>>> RUN_PASS(2, ModifierFolding, run); // before load propagation -> less
>>> checks
>>> RUN_PASS(1, ConstantFolding, foldAll);
>>> + RUN_PASS(2, LateAlgebraicOpt, run);
>>
>>
>> What is the reason to add a new AlgebraicOpt pass for this?
>
>
> Because a bunch of optimizations are applied in ConstantFolding which means
> this one has to be done *after*. My first attempt was to do it in
> AlgebraicOpt, but it's actually not the best place. :-)
>
Yeah but this applies to many other opts as well. I think in the end
we need to run those passes inside a loop and iterate like 2 or 3
times like I do here:
https://github.com/karolherbst/mesa/commit/e2ae247aa27ddb150ff4ccd1c0f19baa470134ef
The first line is cut of, but the benefit is around 0.5% less
instructions and a decent speed up in pixmark_piano.
>>
>>> RUN_PASS(1, LoadPropagation, run);
>>> RUN_PASS(1, IndirectPropagation, run);
>>> RUN_PASS(2, MemoryOpt, run);
>>> --
>>> 2.10.0
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list