[Mesa-dev] [PATCH v3 4/4] nv50/ir: further optimize multiplication by immediates
Karol Herbst
kherbst at redhat.com
Sat Aug 11 19:59:59 UTC 2018
yeah, I was mainly commenting on the questionble performance gains. We
can't just assume less instructions == more perf as we don't really
know what changing instructions really means.
And right, I wasn't really taking LoadPropagation into account, but it
seems like that at least nvidia prefers XMAD over anything else,
except it can do one shll(add).
I think only real benchmarks can really tell us what the better
approach would be. I am mainly curious how fast XMAD really is.
On Sat, Aug 11, 2018 at 4:40 PM, Rhys Perry <pendingchaos02 at gmail.com> wrote:
> It seems multiplication by negative powers of two are nonexistent in the
> shader-db, so an specialized optimization for them would probably not be
> worth it.
>
> It seems my approach gives better instruction counts in shader-db than
> your approach, since it can generate shorter (for things like a * 7) and
> more LoadPropagation/ModifierPropagation friendly code.
>
> Yours gives slightly better gpr counts overall, though a decent bit of
> shaders seem to do much better gpr-wise with my approach. It also seems to
> end up being slightly worse gpr-wise after making the instruction count
> closer to my approach.
>
> Comparing my approach to yours:
> total instructions in shared programs : 5802201 -> 5818813 (0.29%)
> total gprs used in shared programs : 669876 -> 669708 (-0.03%)
> total shared used in shared programs : 548832 -> 548832 (0.00%)
> total local used in shared programs : 21068 -> 21068 (0.00%)
>
> local shared gpr inst bytes
> helped 0 0 322 105 105
> hurt 0 0 133 2711 2711
>
> Comparing my approach to yours after making it closer to mine by adding a
> few more specializations:
> total instructions in shared programs : 5802201 -> 5807994 (0.10%)
> total gprs used in shared programs : 669876 -> 669987 (0.02%)
> total shared used in shared programs : 548832 -> 548832 (0.00%)
> total local used in shared programs : 21068 -> 21068 (0.00%)
>
> local shared gpr inst bytes
> helped 0 0 142 416 416
> hurt 0 0 109 489 489
> On Sat, Aug 11, 2018 at 11:37 AM Karol Herbst <kherbst at redhat.com> wrote:
>>
>> I think we could do something else (which may even cover more cases):
>>
>> 1. try to use a shl (we already do that)
>>
>> 2 use shladd for all negative imms with for all power of two negative
>> immediates (are we already doing it? I think we miss a lot of opts
>> where "worse" instructions could include modifiers and we basically
>> save neg/abs instructions. OP_SHL can't take a neg, so we have to use
>> OP_SHLADD for that)
>> b = shladd(neg a, log2(imm), 0x0)
>>
>> 3. for all immediates in [0xffff, 0x0]:
>> t = xmad(a, imm, 0x0)
>> b = xmad.PSL(a.hi, imm, t)
>>
>> which should be already quite good.
>>
>> I don't know if using shifts or shladd is faster then xmad, so without
>> benchmarks I wouldn't want to include more complex optimizations if we
>> don't know they pay off. Nvidia doesn't seem to do that either, but
>> they use shladd in that negative immediate case. Maybe they don't do
>> it for the trivial case shladd(a, log2(imm), a) for power of two + 1
>> because they simply don't care. Don't know. I am sure it would be
>> worth it to see where it actually makes a different adding your opts
>> after the three above ones. Or maybe yours is also faster than the
>> third one above. I don't know.
>>
>> On Mon, Jul 23, 2018 at 12:40 PM, Rhys Perry <pendingchaos02 at gmail.com> wrote:
>> > Strongly mitigates the harm from the previous commit, which made many
>> > integer multiplications much more heavy on the register and instruction
>> > count.
>> >
>> > total instructions in shared programs : 5839715 -> 5801926 (-0.65%)
>> > total gprs used in shared programs : 670553 -> 669853 (-0.10%)
>> > total shared used in shared programs : 548832 -> 548832 (0.00%)
>> > total local used in shared programs : 21164 -> 21068 (-0.45%)
>> >
>> > local shared gpr inst bytes
>> > helped 1 0 408 2522 2522
>> > hurt 0 0 232 23 23
>> >
>> > Signed-off-by: Rhys Perry <pendingchaos02 at gmail.com>
>> > ---
>> > .../drivers/nouveau/codegen/nv50_ir_peephole.cpp | 126 ++++++++++++++++++---
>> > 1 file changed, 112 insertions(+), 14 deletions(-)
>> >
>> > diff --git a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>> > index a6ddb284b8..e5d033c9b0 100644
>> > --- a/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>> > +++ b/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp
>> > @@ -379,6 +379,10 @@ private:
>> >
>> > CmpInstruction *findOriginForTestWithZero(Value *);
>> >
>> > + Value *createMulMethod1(Value *a, unsigned b, Value *c);
>> > + Value *createMulMethod2(Value *a, unsigned b, Value *c);
>> > + Value *createMul(Value *a, unsigned b, Value *c);
>> > +
>> > unsigned int foldCount;
>> >
>> > BuildUtil bld;
>> > @@ -953,6 +957,88 @@ ConstantFolding::opnd3(Instruction *i, ImmediateValue &imm2)
>> > }
>> > }
>> >
>> > +Value *
>> > +ConstantFolding::createMulMethod1(Value *a, unsigned b, Value *c)
>> > +{
>> > + // basically constant folded shift and add multiplication.
>> > + Value *res = c ? c : bld.loadImm(NULL, 0u);
>> > + bool resZero = !c;
>> > + unsigned ashift = 0;
>> > + while (b) {
>> > + if ((b & 1) && ashift) {
>> > + Value *sh = bld.loadImm(NULL, ashift);
>> > + if (resZero)
>> > + res = bld.mkOp2v(OP_SHL, TYPE_U32, bld.getSSA(), a, sh);
>> > + else
>> > + res = bld.mkOp3v(OP_SHLADD, TYPE_U32, bld.getSSA(), a, sh, res);
>> > + resZero = false;
>> > + } else if (b & 1) {
>> > + if (resZero)
>> > + res = a;
>> > + else
>> > + res = bld.mkOp2v(OP_ADD, TYPE_U32, bld.getSSA(), res, a);
>> > + resZero = false;
>> > + }
>> > + b >>= 1;
>> > + ashift++;
>> > + }
>> > + return res;
>> > +}
>> > +
>> > +Value *
>> > +ConstantFolding::createMulMethod2(Value *a, unsigned b, Value *c)
>> > +{
>> > + // basically does a * b2 - a * (b2 - b) + c
>> > + uint64_t b2 = util_next_power_of_two64(b);
>> > + unsigned b2shift = ffsll(b2) - 1;
>> > +
>> > + Value *mul1 = createMulMethod1(a, b2 - b, NULL);
>> > +
>> > + Value *res;
>> > + if (b2shift < 32) {
>> > + Instruction *i = bld.mkOp3(OP_SHLADD, TYPE_U32, bld.getSSA(),
>> > + a, bld.loadImm(NULL, b2shift), mul1);
>> > + res = i->getDef(0);
>> > +
>> > + // all targets supporting OP_SHLADD should pass this
>> > + assert(bld.getProgram()->getTarget()->isModSupported(i, 2, NV50_IR_MOD_NEG));
>> > + i->src(2).mod *= Modifier(NV50_IR_MOD_NEG);
>> > + } else {
>> > + res = bld.mkOp1v(OP_NEG, TYPE_U32, bld.getSSA(), mul1);
>> > + }
>> > +
>> > + if (c)
>> > + res = bld.mkOp2v(OP_ADD, TYPE_U32, bld.getSSA(), res, c);
>> > +
>> > + return res;
>> > +}
>> > +
>> > +Value *
>> > +ConstantFolding::createMul(Value *a, unsigned b, Value *c)
>> > +{
>> > + unsigned cost[2];
>> > +
>> > + // estimate cost for first method (a << i) + (b << j) + ...
>> > + cost[0] = util_bitcount64(b >> 1);
>> > +
>> > + // estimate cost for second method (a << i) - ((a << j) + (a << k) + ...)
>> > + uint64_t rounded_b = util_next_power_of_two64(b);
>> > + cost[1] = rounded_b == b ? 1 : (util_bitcount64((rounded_b - b) >> 1) + 1);
>> > + if (c) cost[1]++;
>> > +
>> > + // The general method, multiplication by XMADs, costs three instructions.
>> > + // So nothing much larger than that or it could be making things worse.
>> > + if (cost[0] > 3 && cost[1] > 3)
>> > + return NULL;
>> > +
>> > + // the cost is the same for both methods with powers of twos
>> > + // but method 1 creates more optimizable code
>> > + if (cost[0] <= cost[1])
>> > + return createMulMethod1(a, b, c);
>> > + else
>> > + return createMulMethod2(a, b, c);
>> > +}
>> > +
>> > void
>> > ConstantFolding::opnd(Instruction *i, ImmediateValue &imm0, int s)
>> > {
>> > @@ -1040,13 +1126,25 @@ ConstantFolding::opnd(Instruction *i, ImmediateValue &imm0, int s)
>> > i->setSrc(s, i->getSrc(t));
>> > i->src(s).mod = i->src(t).mod;
>> > } else
>> > - if (!isFloatType(i->sType) && !imm0.isNegative() && imm0.isPow2()) {
>> > - i->op = OP_SHL;
>> > - imm0.applyLog2();
>> > - i->setSrc(0, i->getSrc(t));
>> > - i->src(0).mod = i->src(t).mod;
>> > - i->setSrc(1, new_ImmediateValue(prog, imm0.reg.data.u32));
>> > - i->src(1).mod = 0;
>> > + if (!isFloatType(i->dType)) {
>> > + bool optimized = false;
>> > + if (target->isOpSupported(OP_SHLADD, TYPE_U32)) {
>> > + bld.setPosition(i, false);
>> > + Value *val = createMul(i->getSrc(t), imm0.reg.data.u32, NULL);
>> > + if (val) {
>> > + i->def(0).replace(val, false);
>> > + delete_Instruction(prog, i);
>> > + optimized = true;
>> > + }
>> > + }
>> > + if (!optimized && !imm0.isNegative() && imm0.isPow2()) {
>> > + i->op = OP_SHL;
>> > + imm0.applyLog2();
>> > + i->setSrc(0, i->getSrc(t));
>> > + i->src(0).mod = i->src(t).mod;
>> > + i->setSrc(1, new_ImmediateValue(prog, imm0.reg.data.u32));
>> > + i->src(1).mod = 0;
>> > + }
>> > } else
>> > if (i->postFactor && i->sType == TYPE_F32) {
>> > /* Can't emit a postfactor with an immediate, have to fold it in */
>> > @@ -1079,13 +1177,13 @@ ConstantFolding::opnd(Instruction *i, ImmediateValue &imm0, int s)
>> > i->setSrc(2, NULL);
>> > i->op = OP_ADD;
>> > } else
>> > - if (s == 1 && !imm0.isNegative() && imm0.isPow2() &&
>> > - !isFloatType(i->dType) &&
>> > - target->isOpSupported(OP_SHLADD, i->dType) &&
>> > - !i->subOp) {
>> > - i->op = OP_SHLADD;
>> > - imm0.applyLog2();
>> > - i->setSrc(1, new_ImmediateValue(prog, imm0.reg.data.u32));
>> > + if (!isFloatType(i->dType) && target->isOpSupported(OP_SHLADD, TYPE_U32) && !i->subOp) {
>> > + bld.setPosition(i, false);
>> > + Value *val = createMul(i->getSrc(t), imm0.reg.data.u32, i->getSrc(2));
>> > + if (val) {
>> > + i->def(0).replace(val, false);
>> > + delete_Instruction(prog, i);
>> > + }
>> > }
>> > break;
>> > case OP_SUB:
>> > --
>> > 2.14.4
>> >
>> > _______________________________________________
>> > mesa-dev mailing list
>> > mesa-dev at lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list