[Bug 92760] Add FP64 support to the i965 shader backends

Tue Mar 1 17:12:59 UTC 2016

https://bugs.freedesktop.org/show_bug.cgi?id=92760

--- Comment #60 from Jason Ekstrand <jason at jlekstrand.net> ---
(In reply to Iago Toral from comment #58)
> (In reply to Jason Ekstrand from comment #30)
> > Created attachment 120957 [details]
> > NIR indirect lowering pass
> > 
> > (In reply to Iago Toral from comment #29)
> > > Hey Jason/Connor,
> > > 
> > > the lowering of trunc for doubles has some code that looks like this
> > > (pseudo-code):
> > > 
> > > if (exponent < 0) {
> > >    mask = 0x0
> > > } else if (exponent > 52) {
> > >    mask = 0x7fffffffffffffff;
> > > } else {
> > >    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> > >    mask =  (1LL << frac_bits) - 1;
> > > }
> > > 
> > > The current implementation I have works fine using bcsel. It looks something
> > > like this (again, pseudo-code):
> > > 
> > > mask = bcsel(exponent < 0,
> > >              0x7fffffffffffffff,
> > >              bcsel(exponent > 52,
> > >                    0x0000000000000000,
> > >                    (1LL << frac_bits) -1))
> > > 
> > > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > > operation that we have to implement in terms of hi/lo 32-bit integer
> > > operations (at least until we support 64-bit integers), so it is really a
> > > bunch of instructions. Because I use bcsel, it means that we generate code
> > > for that even if exponent is not in [1..51], which is not ideal.
> > 
> > Right.  I would encourage you not to use if's too much because branching may
> > be more expensive than bcsel depending on what paths different invocations
> > take.  However, if one side of the if is overwhelmingly more likely than the
> > other, then control-flow is probably a good idea.
> 
> I have been revisiting this. Because if statements in NIR are strictly
> scalar, this lowering needs to be scalarized as well. I wonder if the
> scalarized code resulting of this defeats the purpose of using the if
> statement for the vec4 backend, since we lose the ability to use vector
> instructions.
> 
> Some quick experiments with a simple trunc() test show these results
> (#instructions):
> 
> backed         bcsel      if (unscalarized)     if (scalarized)
> ----------------------------------------------------------------
> vec4            65              69                   102
> fs (simd8)      67              85                    85
> fs (simd16)     95             119                   119
> 
> bcsel implementations have less overall instructions as expected, although
> as discussed before, if implementations may be better in some cases since
> they might end up executing less instructions in some cases. However, it is
> clear that the required scalarization for the if statement in the vec4
> backend makes things much worse, to a point that  I am not sure any more
> that this is a win in this scenario.
> 
> So we have 2 options again:
> 
> 1) Go back to the bcsel implementation for both backends.
> 2) Pass an is_scalar flag to the lowering pass, choose the bcsel
> implementation for non scalar backends and the scalarized if implementation
> for scalar.
> 
> 2) _might_ be better overall from a performance standpoint but I wonder if
> it is worth having two different implementations of this. This decision
> would also affect the implementation of roundEven().
> 
> What do you think?

I'm sorry that it has taken so long for me to get back to you.  We've had a lot
going on at the office lately.

I think (1) is probably the best option for a couple of reasons:

 a) Simplicity, as you mentioned, and the ability to vectorize in vec4.

 b) Even in simd8 mode, you are executing 8 threads at a time and you will only
actually skip instructions if all 8 threads take the same side of the branch. 
While this may happen fairly frequently depending on the context, there's still
a decent chance that all of the instructions will get executed anyway.

 c) Even if (b) isn't a problem, using bcsel gives us much more freedom when
scheduling instructions.  Any sort of control flow acts as an instruction
scheduling barrier and prevents us from moving things around.  This is a
problem for both register pressure (causing spilling) and instruction latency. 
If we use bcsel, there's a decent chance that we can move instructions around
and hide latency enough to make up for the extra instructions being executed.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-3d-bugs/attachments/20160301/6dfec06a/attachment-0001.html>