<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW - Add FP64 support to the i965 shader backends" href="https://bugs.freedesktop.org/show_bug.cgi?id=92760#c60">Comment # 60</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - Add FP64 support to the i965 shader backends" href="https://bugs.freedesktop.org/show_bug.cgi?id=92760">bug 92760</a> from <a class="email" href="mailto:jason@jlekstrand.net" title="Jason Ekstrand <jason@jlekstrand.net>"> Jason Ekstrand</a> <pre>(In reply to Iago Toral from <a href="show_bug.cgi?id=92760#c58">comment #58</a>) > (In reply to Jason Ekstrand from <a href="show_bug.cgi?id=92760#c30">comment #30</a>) > > Created <a href="attachment.cgi?id=120957" name="attach_120957" title="NIR indirect lowering pass">attachment 120957</a> <a href="attachment.cgi?id=120957&action=edit" title="NIR indirect lowering pass">[details]</a> > > NIR indirect lowering pass > > > > (In reply to Iago Toral from <a href="show_bug.cgi?id=92760#c29">comment #29</a>) > > > Hey Jason/Connor, > > > > > > the lowering of trunc for doubles has some code that looks like this > > > (pseudo-code): > > > > > > if (exponent < 0) { > > > mask = 0x0 > > > } else if (exponent > 52) { > > > mask = 0x7fffffffffffffff; > > > } else { > > > /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */ > > > mask = (1LL << frac_bits) - 1; > > > } > > > > > > The current implementation I have works fine using bcsel. It looks something > > > like this (again, pseudo-code): > > > > > > mask = bcsel(exponent < 0, > > > 0x7fffffffffffffff, > > > bcsel(exponent > 52, > > > 0x0000000000000000, > > > (1LL << frac_bits) -1)) > > > > > > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer > > > operation that we have to implement in terms of hi/lo 32-bit integer > > > operations (at least until we support 64-bit integers), so it is really a > > > bunch of instructions. Because I use bcsel, it means that we generate code > > > for that even if exponent is not in [1..51], which is not ideal. > > > > Right. I would encourage you not to use if's too much because branching may > > be more expensive than bcsel depending on what paths different invocations > > take. However, if one side of the if is overwhelmingly more likely than the > > other, then control-flow is probably a good idea. > > I have been revisiting this. Because if statements in NIR are strictly > scalar, this lowering needs to be scalarized as well. I wonder if the > scalarized code resulting of this defeats the purpose of using the if > statement for the vec4 backend, since we lose the ability to use vector > instructions. > > Some quick experiments with a simple trunc() test show these results > (#instructions): > > backed bcsel if (unscalarized) if (scalarized) > ---------------------------------------------------------------- > vec4 65 69 102 > fs (simd8) 67 85 85 > fs (simd16) 95 119 119 > > bcsel implementations have less overall instructions as expected, although > as discussed before, if implementations may be better in some cases since > they might end up executing less instructions in some cases. However, it is > clear that the required scalarization for the if statement in the vec4 > backend makes things much worse, to a point that I am not sure any more > that this is a win in this scenario. > > So we have 2 options again: > > 1) Go back to the bcsel implementation for both backends. > 2) Pass an is_scalar flag to the lowering pass, choose the bcsel > implementation for non scalar backends and the scalarized if implementation > for scalar. > > 2) _might_ be better overall from a performance standpoint but I wonder if > it is worth having two different implementations of this. This decision > would also affect the implementation of roundEven(). > > What do you think? I'm sorry that it has taken so long for me to get back to you. We've had a lot going on at the office lately. I think (1) is probably the best option for a couple of reasons: a) Simplicity, as you mentioned, and the ability to vectorize in vec4. b) Even in simd8 mode, you are executing 8 threads at a time and you will only actually skip instructions if all 8 threads take the same side of the branch. While this may happen fairly frequently depending on the context, there's still a decent chance that all of the instructions will get executed anyway. c) Even if (b) isn't a problem, using bcsel gives us much more freedom when scheduling instructions. Any sort of control flow acts as an instruction scheduling barrier and prevents us from moving things around. This is a problem for both register pressure (causing spilling) and instruction latency. If we use bcsel, there's a decent chance that we can move instructions around and hide latency enough to make up for the extra instructions being executed.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the QA Contact for the bug.</li> </ul> </body> </html>