<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - Add FP64 support to the i965 shader backends"
href="https://bugs.freedesktop.org/show_bug.cgi?id=92760#c60">Comment # 60</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - Add FP64 support to the i965 shader backends"
href="https://bugs.freedesktop.org/show_bug.cgi?id=92760">bug 92760</a>
from <span class="vcard"><a class="email" href="mailto:jason@jlekstrand.net" title="Jason Ekstrand <jason@jlekstrand.net>"> <span class="fn">Jason Ekstrand</span></a>
</span></b>
<pre>(In reply to Iago Toral from <a href="show_bug.cgi?id=92760#c58">comment #58</a>)
<span class="quote">> (In reply to Jason Ekstrand from <a href="show_bug.cgi?id=92760#c30">comment #30</a>)
> > Created <span class=""><a href="attachment.cgi?id=120957" name="attach_120957" title="NIR indirect lowering pass">attachment 120957</a> <a href="attachment.cgi?id=120957&action=edit" title="NIR indirect lowering pass">[details]</a></span>
> > NIR indirect lowering pass
> >
> > (In reply to Iago Toral from <a href="show_bug.cgi?id=92760#c29">comment #29</a>)
> > > Hey Jason/Connor,
> > >
> > > the lowering of trunc for doubles has some code that looks like this
> > > (pseudo-code):
> > >
> > > if (exponent < 0) {
> > > mask = 0x0
> > > } else if (exponent > 52) {
> > > mask = 0x7fffffffffffffff;
> > > } else {
> > > /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> > > mask = (1LL << frac_bits) - 1;
> > > }
> > >
> > > The current implementation I have works fine using bcsel. It looks something
> > > like this (again, pseudo-code):
> > >
> > > mask = bcsel(exponent < 0,
> > > 0x7fffffffffffffff,
> > > bcsel(exponent > 52,
> > > 0x0000000000000000,
> > > (1LL << frac_bits) -1))
> > >
> > > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > > operation that we have to implement in terms of hi/lo 32-bit integer
> > > operations (at least until we support 64-bit integers), so it is really a
> > > bunch of instructions. Because I use bcsel, it means that we generate code
> > > for that even if exponent is not in [1..51], which is not ideal.
> >
> > Right. I would encourage you not to use if's too much because branching may
> > be more expensive than bcsel depending on what paths different invocations
> > take. However, if one side of the if is overwhelmingly more likely than the
> > other, then control-flow is probably a good idea.
>
> I have been revisiting this. Because if statements in NIR are strictly
> scalar, this lowering needs to be scalarized as well. I wonder if the
> scalarized code resulting of this defeats the purpose of using the if
> statement for the vec4 backend, since we lose the ability to use vector
> instructions.
>
> Some quick experiments with a simple trunc() test show these results
> (#instructions):
>
> backed bcsel if (unscalarized) if (scalarized)
> ----------------------------------------------------------------
> vec4 65 69 102
> fs (simd8) 67 85 85
> fs (simd16) 95 119 119
>
> bcsel implementations have less overall instructions as expected, although
> as discussed before, if implementations may be better in some cases since
> they might end up executing less instructions in some cases. However, it is
> clear that the required scalarization for the if statement in the vec4
> backend makes things much worse, to a point that I am not sure any more
> that this is a win in this scenario.
>
> So we have 2 options again:
>
> 1) Go back to the bcsel implementation for both backends.
> 2) Pass an is_scalar flag to the lowering pass, choose the bcsel
> implementation for non scalar backends and the scalarized if implementation
> for scalar.
>
> 2) _might_ be better overall from a performance standpoint but I wonder if
> it is worth having two different implementations of this. This decision
> would also affect the implementation of roundEven().
>
> What do you think?</span >
I'm sorry that it has taken so long for me to get back to you. We've had a lot
going on at the office lately.
I think (1) is probably the best option for a couple of reasons:
a) Simplicity, as you mentioned, and the ability to vectorize in vec4.
b) Even in simd8 mode, you are executing 8 threads at a time and you will only
actually skip instructions if all 8 threads take the same side of the branch.
While this may happen fairly frequently depending on the context, there's still
a decent chance that all of the instructions will get executed anyway.
c) Even if (b) isn't a problem, using bcsel gives us much more freedom when
scheduling instructions. Any sort of control flow acts as an instruction
scheduling barrier and prevents us from moving things around. This is a
problem for both register pressure (causing spilling) and instruction latency.
If we use bcsel, there's a decent chance that we can move instructions around
and hide latency enough to make up for the extra instructions being executed.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the QA Contact for the bug.</li>
</ul>
</body>
</html>