[Mesa-dev] [PATCH 0/2] Nir: Allow CSE of SSBO loads

Fri Oct 23 04:13:27 PDT 2015

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Thu, Oct 22, 2015 at 10:37 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>
>>> On Thu, Oct 22, 2015 at 7:21 AM, Iago Toral Quiroga <itoral at igalia.com> wrote:
>>>> I implemented this first as a separate optimization pass in GLSL IR [1], but
>>>> Curro pointed out that this being pretty much a restricted form of a CSE pass
>>>> it would probably make more sense to do it inside CSE (and we no longer have
>>>> a CSE pass in GLSL IR).
>>>>
>>>> Unlike other things we CSE in NIR, in the case of SSBO loads we need to make
>>>> sure that we invalidate previous entries in the set in the presence of
>>>> conflicting instructions (i.e. SSBO writes to the same block and offset) or
>>>> in the presence of memory barriers.
>>>>
>>>> If this is accepted I intend to extend this to also cover image reads, which
>>>> follow similar behavior.
>>>>
>>>> No regressions observed in piglit or dEQP's SSBO functional tests.
>>>>
>>>> [1] http://lists.freedesktop.org/archives/mesa-dev/2015-October/097718.html
>>>>
>>>> Iago Toral Quiroga (2):
>>>>   nir/cse: invalidate SSBO loads in presence of ssbo writes or memory
>>>>     barriers
>>>>   nir/instr_set: allow rewrite of SSBO loads
>>>>
>>>>  src/glsl/nir/nir_instr_set.c |  24 ++++++--
>>>>  src/glsl/nir/nir_opt_cse.c   | 142 +++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 162 insertions(+), 4 deletions(-)
>>>>
>>>> --
>>>> 1.9.1
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>> NAK, this isn't going to work. NIR CSE is designed for operations
>>> which can be moved around freely as long they're still dominated by
>>> the SSA values they use. It makes heavy advantage of this to avoid
>>> looking at the entire CFG and instead only at the current block and
>>> its parents in the dominance tree. For example, imagine you have
>>> something like:
>>>
>>> A = load_ssbo 0
>>> if  (cond) {
>>>    store_ssbo 0
>>> }
>>> B = load_ssbo 0
>>>
>>> Then A and B can't be combined, but CSE will combine them anyways when
>>> it reaches B because it keeps a hash table of values dominating B and
>>> finds A as a match. It doesn't look at the if conditional at all
>>> because it doesn't dominate the load to B. This is great when you want
>>> to CSE pure things that don't depend on other side effects -- after
>>> all, this is the sort of efficiency that SSA is supposed to give us --
>>> but it means that as-is, it can't be used for e.g. SSBO's and images
>>> without completely changing how the pass works and making it less
>>> efficient.
>>>
>>> Now, that being said, I still think that we should definitely be doing
>>> this sort of thing in NIR now that we've finally added support for
>>> SSBO's and images. We've been trying to avoid adding new optimizations
>>> to GLSL, since we've been trying to move away from it. In addition,
>>> with SPIR-V on the way, anything added to GLSL IR now is something
>>> that we won't be able to use with SPIR-V shaders. Only doing it in FS
>>> doesn't sound so great either; we should be doing as much as possible
>>> at the midlevel, and combining SSBO loads is something that isn't
>>> FS-specific at all.
>>>
>>> There are two ways I can see support for this being added to NIR:
>>>
>>> 1. Add an extra fake source/destination to intrinsics with side
>>> effects, and add a pass to do essentially a conversion to SSA that
>>> wires up these "token" sources/destinations, or perhaps extend the
>>> existing to-SSA pass.
>>>
>> This is in fact the approach taken by other SSA-form IRs like LLVM's --
>> Side-effectful instructions take a "chain" input as argument and spit
>> out a new chain token which can then be passed as argument to subsequent
>> non-pure instructions for which a control-flow dependency exists.
>
> IIUC, LLVM actually doesn't take this approach; it has a special
> LoadCombine pass [1]  and various other load/store-specific
> optimizations as well, and doesn't use token passing.

It uses token passing extensively in selection DAGs to represent
control-flow dependencies between things like loads and stores.
E.g. see 'lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp'.

> But yeah, it's not exactly a new idea -- although I'm still not
> convinced that the "obvious" way of doing it can handle arbitrary code
> motion. For example, given something like:
>
> store foo, b;
> if (cond) {
>     store foo, a;
> }
> ... = load foo;
>
> then the obvious way to handle the tokens would be:
>
> token1 = store foo, b;
> if (cond) {
>     token2 = store foo, a, token1;
> }
> token3 = phi(token2, token1);
> ... = load foo, token3;
>

Yeah, I don't think it would be a good idea to allow arbitrary motion of
stores and atomics under this approach.

> This orders the load and the store, but now theoretically there's
> nothing preventing, say, GCM from moving the store up:
>
> token1 = store foo, b;
> token2 = store foo, a, token1;
> if (cond) {
> }
> token3 = phi(token2, token1);
> ... = load foo, token3;
>
> which is obviously illegal. Of course, in reality, GCM wouldn't do
> that, but the point still stands that the control dependencies aren't
> being fully expressed through the SSA tokens. The way I came up to
> solve that was to borrow an idea from SSI [2] and add a "sigma node"
> that's the inverse of a phi node, i.e. it exists at the end of the
> basic block with multiple successors and has a single input an output
> for each successor. The sigma nodes are tied their basic block
> similarly to phi nodes. Then the example becomes:
>
> token1 = store foo, b;
> token2, token3 = sigma(token1);
> if (cond) {
>     token4 = store foo, a, token2;
> }
> token5 = phi(token4, token2);
> ... = load foo, token6;
>
> and the sigma expresses the control dependency of the store.
>

That would likely work, but I'm not clear what the benefit would be over
just preventing motion of stores across basic blocks (except where you
can prove that the origin and destination nodes in the CFG form a
single-entry single-exit region, task which doesn't seem to be
substantially simplified after you've woven the CFG with sigma and phi
nodes passing the tokens around).

> Another thing is that we need another operation on tokens to express
> the idea that "A and B must happen before C." For example, we need
> this to express the fact that loads can happen in parallel, or else we
> won't be able to CSE loads.
>
LLVM's selection DAGs use nodes (TokenFactor) that take multiple tokens
as input and give a single token as output to express this.

> There are still a couple questions left, though: is this now correct?
> What kind of transformations can we safely do to expose more freedom
> and e.g. predicate stores?
>
> Is there someone else out there that's done something similar? Most
> schemes that I've seen, like HSSA [3], are much more conservative, not
> solving the predicated store problem (so presumably, you have to be
> careful when doing code motion). There's the program dependence graph
> [4], but they seem to indicate control dependence in a different way.
> This makes sense, since having to construct full-blown SSI for every
> pointer dereference would probably murder your performance and memory
> usage for most programs designed to run on the CPU. SSBO's are a much
> rarer thing, and I would presume they aren't used as often in shaders,
> so it's probably more practical for us to do it, but it doesn't
> exactly look like this is a well-trodden path.
>
> [1] http://llvm.org/docs/doxygen/html/LoadCombine_8cpp_source.html
> [2] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-801.pdf
> [3] "Effective Representation of Aliases and Indirect Memory
> Operations in SSA Form" by Chow et. al.
> [4] https://www.cs.utexas.edu/~pingali/CS395T/2009fa/papers/ferrante87.pdf
>
>>
>>> 2. Add a special "load-combining" pass that does some dataflow
>>> analysis or similar (or, for now, only looks at things within a single
>>> block).
>>>
>>> The advantage of #1 is that we get to use existing NIR passes, like
>>> CSE, DCE, and GCM "for free" on SSBO loads and stores, without having
>>> to do the equivalent thing using dataflow analysis. Also, doing store
>>> forwarding (i.e. replacing the result of an SSBO load with the value
>>> corresponding to a store, if we can figure out which store affects it)
>>> is going to much easier. However, #1 is going to be much more of a
>>> research project. I've thought about how we could do it, but I'm still
>>> not sure how it could be done feasibly and still be correct.
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151023/5e295f55/attachment-0001.sig>