[Mesa-dev] [PATCH] nv50/ir: optimize ADD(SHL(a, b), c) to SHLADD(a, b, c)
Ilia Mirkin
imirkin at alum.mit.edu
Sun Oct 9 19:34:54 UTC 2016
On Sun, Oct 9, 2016 at 3:28 PM, Karol Herbst <karolherbst at gmail.com> wrote:
> 2016-10-09 13:58 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>>
>>
>> On 10/08/2016 10:04 PM, Karol Herbst wrote:
>>>
>>> looks great, a few comments below
>>
>>
>> Thanks!
>>
>>>
>>> 2016-10-08 21:55 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>>>>
>>>> total instructions in shared programs :2286901 -> 2284473 (-0.11%)
>>>> total gprs used in shared programs :335256 -> 335273 (0.01%)
>>>> total local used in shared programs :31968 -> 31968 (0.00%)
>>>>
>>>> local gpr inst bytes
>>>> helped 0 41 852 852
>>>> hurt 0 44 23 23
>>>>
>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>> ---
>>>> .../drivers/nouveau/codegen/nv50_ir_peephole.cpp | 94
>>>> ++++++++++++++++++++++
>>>> 1 file changed, 94 insertions(+)
>>>>
>>>> diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>> b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>> index 6efb29e..caf5d1d 100644
>>>> --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>> +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>> @@ -2132,6 +2132,99 @@ AlgebraicOpt::visit(BasicBlock *bb)
>>>>
>>>> //
>>>> =============================================================================
>>>>
>>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>>> +class LateAlgebraicOpt : public Pass
>>>> +{
>>>> +private:
>>>> + virtual bool visit(BasicBlock *);
>>>> +
>>>> + void handleADD(Instruction *);
>>>> + bool tryADDToSHLADD(Instruction *);
>>>> +
>>>> + BuildUtil bld;
>>>> +};
>>>> +
>>>> +void
>>>> +LateAlgebraicOpt::handleADD(Instruction *add)
>>>> +{
>>>> + Value *src0 = add->getSrc(0);
>>>> + Value *src1 = add->getSrc(1);
>>>> +
>>>> + if (src0->reg.file != FILE_GPR || src1->reg.file != FILE_GPR)
>>>> + return;
>>>> +
>>>> + if (prog->getTarget()->isOpSupported(OP_SHLADD, add->dType))
>>>> + tryADDToSHLADD(add);
>>>> +}
>>>> +
>>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>>> +bool
>>>> +LateAlgebraicOpt::tryADDToSHLADD(Instruction *add)
>>>> +{
>>>> + Value *src0 = add->getSrc(0);
>>>> + Value *src1 = add->getSrc(1);
>>>> + Modifier mod[2];
>>>> + Value *src;
>>>> + int s;
>>>> +
>>>> + if (add->saturate || add->usesFlags() || typeSizeof(add->dType) == 8)
>>>> + return false;
>>>> +
>>>> + if (src0->getUniqueInsn() && src0->getUniqueInsn()->op == OP_SHL)
>>>> + s = 0;
>>>> + else
>>>> + if (src1->getUniqueInsn() && src1->getUniqueInsn()->op == OP_SHL)
>>>> + s = 1;
>>>> + else
>>>> + return false;
>>>> +
>>>> + src = add->getSrc(s);
>>>> +
>>>> + if (src->getUniqueInsn()->bb != add->bb)
>>>> + return false;
>>>
>>>
>>> is there a reason to check for the bb here?
>>
>>
>> Those optimizations like ADD->MAD, ADD->SAD, etc, have to be local (ie.
>> inside the same basic block). I just apply the same rule here.
>>
>>>
>>>> +
>>>> + if (src->getInsn()->usesFlags() || src->getInsn()->subOp)
>>>> + return false;
>>>> +
>>>> + if (src->getInsn()->src(1).getFile() != FILE_IMMEDIATE)
>>>> + return false;
>>>> +
>>>> + mod[0] = add->src(0).mod;
>>>> + mod[1] = add->src(1).mod;
>>>> +
>>>> + add->op = OP_SHLADD;
>>>> + add->setSrc(2, add->src(s ? 0 : 1));
>>>
>>>
>>> I usually use "s ^ 1" instead.
>>
>>
>> Yeah, but 's ? 0 : 1' seems to be more used than 's ^ 1' (in peephole).
>>
>>>
>>>> + add->src(2).mod = mod[s];
>>>> +
>>>> + add->setSrc(0, src->getInsn()->getSrc(0));
>>>> + add->src(0).mod = mod[0];
>>>> + add->setSrc(1, src->getInsn()->getSrc(1));
>>>> + add->src(1).mod = Modifier(0);
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> +bool
>>>> +LateAlgebraicOpt::visit(BasicBlock *bb)
>>>> +{
>>>> + Instruction *next;
>>>> +
>>>> + for (Instruction *i = bb->getEntry(); i; i = next) {
>>>> + next = i->next;
>>>> + switch (i->op) {
>>>> + case OP_ADD:
>>>> + handleADD(i);
>>>> + break;
>>>> + default:
>>>> + break;
>>>> + }
>>>> + }
>>>
>>>
>>> you can also just implement visit(Instruction *) and return true; then
>>> you won't need that silly loop.
>>
>>
>> True.
>>
>>>
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> +//
>>>> =============================================================================
>>>> +
>>>> static inline void
>>>> updateLdStOffset(Instruction *ldst, int32_t offset, Function *fn)
>>>> {
>>>> @@ -3436,6 +3529,7 @@ Program::optimizeSSA(int level)
>>>> RUN_PASS(2, AlgebraicOpt, run);
>>>> RUN_PASS(2, ModifierFolding, run); // before load propagation -> less
>>>> checks
>>>> RUN_PASS(1, ConstantFolding, foldAll);
>>>> + RUN_PASS(2, LateAlgebraicOpt, run);
>>>
>>>
>>> What is the reason to add a new AlgebraicOpt pass for this?
>>
>>
>> Because a bunch of optimizations are applied in ConstantFolding which means
>> this one has to be done *after*. My first attempt was to do it in
>> AlgebraicOpt, but it's actually not the best place. :-)
>>
>
> Yeah but this applies to many other opts as well. I think in the end
> we need to run those passes inside a loop and iterate like 2 or 3
> times like I do here:
> https://github.com/karolherbst/mesa/commit/e2ae247aa27ddb150ff4ccd1c0f19baa470134ef
>
> The first line is cut of, but the benefit is around 0.5% less
> instructions and a decent speed up in pixmark_piano.
This is different. The loop helps achieve a fixed point. However this
pass does not enable any further optimizations from taking place, so
it can easily be placed outside such a fixed-point loop. The general
reason why we don't have a loop at all is that the optimizations are
carefully ordered s.t. such a loop only provides minimal improvement.
-ilia
More information about the mesa-dev
mailing list