[Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

Mon May 18 15:31:56 PDT 2015

On Mon, May 18, 2015 at 3:28 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> On Monday, May 18, 2015 11:26:05 AM Matt Turner wrote:
>> On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke <kenneth at whitecape.org> wrote:
>> > According to Glenn, shifts on R600 have 5x the throughput as multiplies.
>> >
>> > Intel GPUs have strange integer multiplication restrictions - on most
>> > hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
>> > means the arguments aren't commutative, which can limit our constant
>> > propagation options.  SHL has no such restrictions.
>> >
>> > Shifting is probably reasonable on most people's hardware, so let's just
>> > do that.
>> >
>> > i965 shader-db results (using NIR for VS):
>> > total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
>> > instructions in affected programs:     1360411 -> 1316806 (-3.21%)
>> > helped:                                5772
>> > HURT:                                  0
>>
>> Just to close the loop, I ran shader-db with this patch on top of my
>> integer multiplication series, and it doesn't change any instruction
>> counts on i965. (I also tried with all other power-of-two
>> multiplications for shift values < 31.)
>>
>> We may want to do it for other reasons though.
>
> If we're going to do it because shifts are faster/nicer than multiplies,
> then we should probably just do it for powers-of-two in general.
> Unfortunately, opt_algebraic doesn't really lend itself to that without
> adding some sort of "power of two" infrastructure.
>
> I guess we could optimize things like:
> a * 2^n  =>  a << n
> a % 2^n  =>  a & (n-1)
> a / 2^n  =>  a >> n (possibly only for unsigned? (*))
> ...others?
>
> The first is a clear win on r600, and the latter are clear wins on i965,
> though they may be rather rare...
>
> We could add a custom NIR pass.  Or, we could just or just have backends
> check for an immediate second operand and do this sort of stuff.  Or
> optimize it themselves.  *shrug*

Or we could just do

for i in range(32):
    optimizations.append(('imul', a, (1 << i)), ('ishl', a, i)))

--Jason