[Mesa-dev] [PATCH 1/6] glsl: Optimize pow(x, 2) into x * x.

Tue Mar 11 10:35:41 PDT 2014

Am 11.03.2014 17:29, schrieb Ian Romanick:
> On 03/10/2014 07:21 PM, Roland Scheidegger wrote:
>> Am 11.03.2014 01:23, schrieb Ian Romanick:
>>> I had a pretty similar patch on the top of my pow-optimization branch.
>>> I also expand x**3 and x**4.  I had hoped that would enable some cases
>>> to expand then merge to MADs.  It should also be faster on older GENs
>>> where POW perf sucks.  I didn't send it out because I wanted to add a
>>> similar optimization in the back end that would turn x*x*x*x back into
>>> x**4 on GPUs where the POW would be faster.
>> I have no idea what performance POW has on newer intel gpu hw (since in
>> contrast to older pre-snb hw with separate mathbox the manual doesn't
>> list throughput for extended math functions, at least I never found it),
>> but I find it highly unlikely that a POW has a cost lower than 2 muls
>> anywhere.
> 
> The architecture has changed quite a bit, so "math box" is kind of a
> thing of that past...
That's why I said pre-SNB hw :-).

> and there was much rejoicing.  The timings that we
> use in the compiler backend are 22 cycles for POW, and 14 cycles for MUL
> on Haswell.  The numbers are similar (but slightly longer) on
> Sandybridge and Ivybridge.
I think that works if you just care about latency. Since it appears you
have a "base latency" of 14 cycles for anything, but 22 for POW however
it looks to me like POW is significantly more expensive. (That is, if
you'd try to issue nothing but POWs or probably other functions from the
extended math group, you'd find you could only get 1/4 or so from the
throughput you get with MULs, since you probably cannot issue that
function every two cycles, but you can do that with MULs. Just a guess
though, assuming that during these additional latency cycles the hw
cannot do another POW, and even if true maybe latency is really still
more relevant in practice. But as said that's just a wild guess I blame
the docs for that :-).)


>> Roland
>>
>>
>>> I also didn't have anything in shader-db that benefitted from x**2 or
>>> x**3.  It seems like there were a couple that would be modified by a
>>> x**5 flattening, but I think that would universally be slower....
>>>
>>> On 03/10/2014 03:54 PM, Matt Turner wrote:
>>>> Cuts two instructions out of SynMark's Gl32VSInstancing benchmark.
>>>> ---
>>>>  src/glsl/opt_algebraic.cpp | 8 ++++++++
>>>>  1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/src/glsl/opt_algebraic.cpp b/src/glsl/opt_algebraic.cpp
>>>> index 5c49a78..8494bd9 100644
>>>> --- a/src/glsl/opt_algebraic.cpp
>>>> +++ b/src/glsl/opt_algebraic.cpp
>>>> @@ -528,6 +528,14 @@ ir_algebraic_visitor::handle_expression(ir_expression *ir)
>>>>        if (is_vec_two(op_const[0]))
>>>>           return expr(ir_unop_exp2, ir->operands[1]);
>>>>  
>>>> +      if (is_vec_two(op_const[1])) {
>>>> +         ir_variable *x = new(ir) ir_variable(ir->operands[1]->type, "x",
>>>> +                                              ir_var_temporary);
>>>> +         base_ir->insert_before(x);
>>>> +         base_ir->insert_before(assign(x, ir->operands[0]));
>>>> +         return mul(x, x);
>>>> +      }
>>>> +
>>>>        break;
>>>>  
>>>>     case ir_unop_rcp: