[Mesa-dev] [PATCH] nv50/ir: optimize ADD(SHL(a, b), c) to SHLADD(a, b, c)

Sun Oct 9 19:44:57 UTC 2016

2016-10-09 21:34 GMT+02:00 Ilia Mirkin <imirkin at alum.mit.edu>:
> On Sun, Oct 9, 2016 at 3:28 PM, Karol Herbst <karolherbst at gmail.com> wrote:
>> 2016-10-09 13:58 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>>>
>>>
>>> On 10/08/2016 10:04 PM, Karol Herbst wrote:
>>>>
>>>> looks great, a few comments below
>>>
>>>
>>> Thanks!
>>>
>>>>
>>>> 2016-10-08 21:55 GMT+02:00 Samuel Pitoiset <samuel.pitoiset at gmail.com>:
>>>>>
>>>>> total instructions in shared programs :2286901 -> 2284473 (-0.11%)
>>>>> total gprs used in shared programs    :335256 -> 335273 (0.01%)
>>>>> total local used in shared programs   :31968 -> 31968 (0.00%)
>>>>>
>>>>>                 local        gpr       inst      bytes
>>>>>     helped           0          41         852         852
>>>>>       hurt           0          44          23          23
>>>>>
>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>> ---
>>>>>  .../drivers/nouveau/codegen/nv50_ir_peephole.cpp   | 94
>>>>> ++++++++++++++++++++++
>>>>>  1 file changed, 94 insertions(+)
>>>>>
>>>>> diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>>> b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>>> index 6efb29e..caf5d1d 100644
>>>>> --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>>> +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>>>>> @@ -2132,6 +2132,99 @@ AlgebraicOpt::visit(BasicBlock *bb)
>>>>>
>>>>>  //
>>>>> =============================================================================
>>>>>
>>>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>>>> +class LateAlgebraicOpt : public Pass
>>>>> +{
>>>>> +private:
>>>>> +   virtual bool visit(BasicBlock *);
>>>>> +
>>>>> +   void handleADD(Instruction *);
>>>>> +   bool tryADDToSHLADD(Instruction *);
>>>>> +
>>>>> +   BuildUtil bld;
>>>>> +};
>>>>> +
>>>>> +void
>>>>> +LateAlgebraicOpt::handleADD(Instruction *add)
>>>>> +{
>>>>> +   Value *src0 = add->getSrc(0);
>>>>> +   Value *src1 = add->getSrc(1);
>>>>> +
>>>>> +   if (src0->reg.file != FILE_GPR || src1->reg.file != FILE_GPR)
>>>>> +      return;
>>>>> +
>>>>> +   if (prog->getTarget()->isOpSupported(OP_SHLADD, add->dType))
>>>>> +      tryADDToSHLADD(add);
>>>>> +}
>>>>> +
>>>>> +// ADD(SHL(a, b), c) -> SHLADD(a, b, c)
>>>>> +bool
>>>>> +LateAlgebraicOpt::tryADDToSHLADD(Instruction *add)
>>>>> +{
>>>>> +   Value *src0 = add->getSrc(0);
>>>>> +   Value *src1 = add->getSrc(1);
>>>>> +   Modifier mod[2];
>>>>> +   Value *src;
>>>>> +   int s;
>>>>> +
>>>>> +   if (add->saturate || add->usesFlags() || typeSizeof(add->dType) == 8)
>>>>> +      return false;
>>>>> +
>>>>> +   if (src0->getUniqueInsn() && src0->getUniqueInsn()->op == OP_SHL)
>>>>> +      s = 0;
>>>>> +   else
>>>>> +   if (src1->getUniqueInsn() && src1->getUniqueInsn()->op == OP_SHL)
>>>>> +      s = 1;
>>>>> +   else
>>>>> +      return false;
>>>>> +
>>>>> +   src = add->getSrc(s);
>>>>> +
>>>>> +   if (src->getUniqueInsn()->bb != add->bb)
>>>>> +      return false;
>>>>
>>>>
>>>> is there a reason to check for the bb here?
>>>
>>>
>>> Those optimizations like ADD->MAD, ADD->SAD, etc, have to be local (ie.
>>> inside the same basic block). I just apply the same rule here.
>>>
>>>>
>>>>> +
>>>>> +   if (src->getInsn()->usesFlags() || src->getInsn()->subOp)
>>>>> +      return false;
>>>>> +
>>>>> +   if (src->getInsn()->src(1).getFile() != FILE_IMMEDIATE)
>>>>> +      return false;
>>>>> +
>>>>> +   mod[0] = add->src(0).mod;
>>>>> +   mod[1] = add->src(1).mod;
>>>>> +
>>>>> +   add->op = OP_SHLADD;
>>>>> +   add->setSrc(2, add->src(s ? 0 : 1));
>>>>
>>>>
>>>> I usually use "s ^ 1" instead.
>>>
>>>
>>> Yeah, but 's ? 0 : 1' seems to be more used than 's ^ 1' (in peephole).
>>>
>>>>
>>>>> +   add->src(2).mod = mod[s];
>>>>> +
>>>>> +   add->setSrc(0, src->getInsn()->getSrc(0));
>>>>> +   add->src(0).mod = mod[0];
>>>>> +   add->setSrc(1, src->getInsn()->getSrc(1));
>>>>> +   add->src(1).mod = Modifier(0);
>>>>> +
>>>>> +   return true;
>>>>> +}
>>>>> +
>>>>> +bool
>>>>> +LateAlgebraicOpt::visit(BasicBlock *bb)
>>>>> +{
>>>>> +   Instruction *next;
>>>>> +
>>>>> +   for (Instruction *i = bb->getEntry(); i; i = next) {
>>>>> +      next = i->next;
>>>>> +      switch (i->op) {
>>>>> +      case OP_ADD:
>>>>> +         handleADD(i);
>>>>> +         break;
>>>>> +      default:
>>>>> +         break;
>>>>> +      }
>>>>> +   }
>>>>
>>>>
>>>> you can also just implement visit(Instruction *) and return true; then
>>>> you won't need that silly loop.
>>>
>>>
>>> True.
>>>
>>>>
>>>>> +
>>>>> +   return true;
>>>>> +}
>>>>> +
>>>>> +//
>>>>> =============================================================================
>>>>> +
>>>>>  static inline void
>>>>>  updateLdStOffset(Instruction *ldst, int32_t offset, Function *fn)
>>>>>  {
>>>>> @@ -3436,6 +3529,7 @@ Program::optimizeSSA(int level)
>>>>>     RUN_PASS(2, AlgebraicOpt, run);
>>>>>     RUN_PASS(2, ModifierFolding, run); // before load propagation -> less
>>>>> checks
>>>>>     RUN_PASS(1, ConstantFolding, foldAll);
>>>>> +   RUN_PASS(2, LateAlgebraicOpt, run);
>>>>
>>>>
>>>> What is the reason to add a new AlgebraicOpt pass for this?
>>>
>>>
>>> Because a bunch of optimizations are applied in ConstantFolding which means
>>> this one has to be done *after*. My first attempt was to do it in
>>> AlgebraicOpt, but it's actually not the best place. :-)
>>>
>>
>> Yeah but this applies to many other opts as well. I think in the end
>> we need to run those passes inside a loop and iterate like 2 or 3
>> times like I do here:
>> https://github.com/karolherbst/mesa/commit/e2ae247aa27ddb150ff4ccd1c0f19baa470134ef
>>
>> The first line is cut of, but the benefit is around 0.5% less
>> instructions and a decent speed up in pixmark_piano.
>
> This is different. The loop helps achieve a fixed point. However this
> pass does not enable any further optimizations from taking place, so
> it can easily be placed outside such a fixed-point loop. The general
> reason why we don't have a loop at all is that the optimizations are
> carefully ordered s.t. such a loop only provides minimal improvement.
>
>   -ilia

right, there is still a pending fix for that on my side:
https://github.com/karolherbst/mesa/commit/2cc9062e886175246d0f93041ea8e953928ba069

but there are other issues as well