[Mesa-dev] [PATCH 1/6] glsl: Optimize pow(x, 2) into x * x.

Tue Mar 11 12:16:40 PDT 2014

On Tue, Mar 11, 2014 at 10:35 AM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 11.03.2014 17:29, schrieb Ian Romanick:
>> and there was much rejoicing.  The timings that we
>> use in the compiler backend are 22 cycles for POW, and 14 cycles for MUL
>> on Haswell.  The numbers are similar (but slightly longer) on
>> Sandybridge and Ivybridge.
> I think that works if you just care about latency. Since it appears you
> have a "base latency" of 14 cycles for anything, but 22 for POW however
> it looks to me like POW is significantly more expensive. (That is, if
> you'd try to issue nothing but POWs or probably other functions from the
> extended math group, you'd find you could only get 1/4 or so from the
> throughput you get with MULs, since you probably cannot issue that
> function every two cycles, but you can do that with MULs. Just a guess
> though, assuming that during these additional latency cycles the hw
> cannot do another POW, and even if true maybe latency is really still
> more relevant in practice. But as said that's just a wild guess I blame
> the docs for that :-).)

Nope, you're right. Haswell can issue 8 multiplies per EU per cycle,
but only one pow.