[Mesa-dev] [PATCH 0/2] Nir: Allow CSE of SSBO loads

Thu Oct 22 08:53:52 PDT 2015

On Thu, Oct 22, 2015 at 10:37 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> On Thu, Oct 22, 2015 at 7:21 AM, Iago Toral Quiroga <itoral at igalia.com> wrote:
>>> I implemented this first as a separate optimization pass in GLSL IR [1], but
>>> Curro pointed out that this being pretty much a restricted form of a CSE pass
>>> it would probably make more sense to do it inside CSE (and we no longer have
>>> a CSE pass in GLSL IR).
>>>
>>> Unlike other things we CSE in NIR, in the case of SSBO loads we need to make
>>> sure that we invalidate previous entries in the set in the presence of
>>> conflicting instructions (i.e. SSBO writes to the same block and offset) or
>>> in the presence of memory barriers.
>>>
>>> If this is accepted I intend to extend this to also cover image reads, which
>>> follow similar behavior.
>>>
>>> No regressions observed in piglit or dEQP's SSBO functional tests.
>>>
>>> [1] http://lists.freedesktop.org/archives/mesa-dev/2015-October/097718.html
>>>
>>> Iago Toral Quiroga (2):
>>>   nir/cse: invalidate SSBO loads in presence of ssbo writes or memory
>>>     barriers
>>>   nir/instr_set: allow rewrite of SSBO loads
>>>
>>>  src/glsl/nir/nir_instr_set.c |  24 ++++++--
>>>  src/glsl/nir/nir_opt_cse.c   | 142 +++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 162 insertions(+), 4 deletions(-)
>>>
>>> --
>>> 1.9.1
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> NAK, this isn't going to work. NIR CSE is designed for operations
>> which can be moved around freely as long they're still dominated by
>> the SSA values they use. It makes heavy advantage of this to avoid
>> looking at the entire CFG and instead only at the current block and
>> its parents in the dominance tree. For example, imagine you have
>> something like:
>>
>> A = load_ssbo 0
>> if  (cond) {
>>    store_ssbo 0
>> }
>> B = load_ssbo 0
>>
>> Then A and B can't be combined, but CSE will combine them anyways when
>> it reaches B because it keeps a hash table of values dominating B and
>> finds A as a match. It doesn't look at the if conditional at all
>> because it doesn't dominate the load to B. This is great when you want
>> to CSE pure things that don't depend on other side effects -- after
>> all, this is the sort of efficiency that SSA is supposed to give us --
>> but it means that as-is, it can't be used for e.g. SSBO's and images
>> without completely changing how the pass works and making it less
>> efficient.
>>
>> Now, that being said, I still think that we should definitely be doing
>> this sort of thing in NIR now that we've finally added support for
>> SSBO's and images. We've been trying to avoid adding new optimizations
>> to GLSL, since we've been trying to move away from it. In addition,
>> with SPIR-V on the way, anything added to GLSL IR now is something
>> that we won't be able to use with SPIR-V shaders. Only doing it in FS
>> doesn't sound so great either; we should be doing as much as possible
>> at the midlevel, and combining SSBO loads is something that isn't
>> FS-specific at all.
>>
>> There are two ways I can see support for this being added to NIR:
>>
>> 1. Add an extra fake source/destination to intrinsics with side
>> effects, and add a pass to do essentially a conversion to SSA that
>> wires up these "token" sources/destinations, or perhaps extend the
>> existing to-SSA pass.
>>
> This is in fact the approach taken by other SSA-form IRs like LLVM's --
> Side-effectful instructions take a "chain" input as argument and spit
> out a new chain token which can then be passed as argument to subsequent
> non-pure instructions for which a control-flow dependency exists.

IIUC, LLVM actually doesn't take this approach; it has a special
LoadCombine pass [1]  and various other load/store-specific
optimizations as well, and doesn't use token passing. But yeah, it's
not exactly a new idea -- although I'm still not convinced that the
"obvious" way of doing it can handle arbitrary code motion. For
example, given something like:

store foo, b;
if (cond) {
    store foo, a;
}
... = load foo;

then the obvious way to handle the tokens would be:

token1 = store foo, b;
if (cond) {
    token2 = store foo, a, token1;
}
token3 = phi(token2, token1);
... = load foo, token3;

This orders the load and the store, but now theoretically there's
nothing preventing, say, GCM from moving the store up:

token1 = store foo, b;
token2 = store foo, a, token1;
if (cond) {
}
token3 = phi(token2, token1);
... = load foo, token3;

which is obviously illegal. Of course, in reality, GCM wouldn't do
that, but the point still stands that the control dependencies aren't
being fully expressed through the SSA tokens. The way I came up to
solve that was to borrow an idea from SSI [2] and add a "sigma node"
that's the inverse of a phi node, i.e. it exists at the end of the
basic block with multiple successors and has a single input an output
for each successor. The sigma nodes are tied their basic block
similarly to phi nodes. Then the example becomes:

token1 = store foo, b;
token2, token3 = sigma(token1);
if (cond) {
    token4 = store foo, a, token2;
}
token5 = phi(token4, token2);
... = load foo, token6;

and the sigma expresses the control dependency of the store.

Another thing is that we need another operation on tokens to express
the idea that "A and B must happen before C." For example, we need
this to express the fact that loads can happen in parallel, or else we
won't be able to CSE loads.

There are still a couple questions left, though: is this now correct?
What kind of transformations can we safely do to expose more freedom
and e.g. predicate stores?

Is there someone else out there that's done something similar? Most
schemes that I've seen, like HSSA [3], are much more conservative, not
solving the predicated store problem (so presumably, you have to be
careful when doing code motion). There's the program dependence graph
[4], but they seem to indicate control dependence in a different way.
This makes sense, since having to construct full-blown SSI for every
pointer dereference would probably murder your performance and memory
usage for most programs designed to run on the CPU. SSBO's are a much
rarer thing, and I would presume they aren't used as often in shaders,
so it's probably more practical for us to do it, but it doesn't
exactly look like this is a well-trodden path.

[1] http://llvm.org/docs/doxygen/html/LoadCombine_8cpp_source.html
[2] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-801.pdf
[3] "Effective Representation of Aliases and Indirect Memory
Operations in SSA Form" by Chow et. al.
[4] https://www.cs.utexas.edu/~pingali/CS395T/2009fa/papers/ferrante87.pdf

>
>> 2. Add a special "load-combining" pass that does some dataflow
>> analysis or similar (or, for now, only looks at things within a single
>> block).
>>
>> The advantage of #1 is that we get to use existing NIR passes, like
>> CSE, DCE, and GCM "for free" on SSBO loads and stores, without having
>> to do the equivalent thing using dataflow analysis. Also, doing store
>> forwarding (i.e. replacing the result of an SSBO load with the value
>> corresponding to a store, if we can figure out which store affects it)
>> is going to much easier. However, #1 is going to be much more of a
>> research project. I've thought about how we could do it, but I'm still
>> not sure how it could be done feasibly and still be correct.
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev